R Data Visualistion 1 - Objective of workshop

To create scatter and bar plot visualisations using the ggplot2 package.

What this workshop will cover

In this workshop, the aim is to cover how to use the ggplot2 package. We will be covering:

  • An introduction to the ggplot2 package
  • How to make scatter plots with ggplot2
  • How to make bar plots with ggplot2
  • How to change colours and other features in your visualisations

Introduction

Data visualisation is a way of looking at your data using graphics, which provides a different perspective to your data.

There are a lot of different options for data visualistion with R. You can use the visualisation tools that come with R, ggplot and all its extensions, or for interactive visualisations there is the plotly library.

In this data visualisation series we will be mainly focussing on ggplot, as well as plotly. While the visualistion tools that come with R are useful, ggplot and plotly are generally easier to use and make great visualisations with. For this tutorial we will be using the below packages: ggplot2, dplyr, readr, janitor. Run the code below to install the packages if you don’t have them installed already.

# install packages
install <- c("ggplot2", "dplyr", "readr",
             "janitor", "RColorBrewer",
             "forcats")

install.packages(install, Ncpus = 6)

Then we need to load them into our session. Run the code chunk below to load all the libraries you will need.

# load packages
library(ggplot2)
library(dplyr)
library(readr)
library(janitor)
library(RColorBrewer)
library(forcats)

What is ggplot and how does it work?

ggplot2 is a package for producing graphics that works by combining independent components when making graphs, known as layers. This makes ggplot2 both versatile and powerful; you are not limited by a set of options but instead can make novel graphics to suit your needs.

It is also important to note that ggplot can only use data frames. If your data is in another format you will need to transform it into a data frame in order to use ggplot.

In order to understand how the layers work we will first load in some data for our examples. We will use data from the Pokémon games, which was web scraped from https://pokemondb.net/pokedex/all.

# load and clean names
pokemon <- read_csv("https://raw.githubusercontent.com/andrewmoles2/webScraping/main/R/data/pokemon.csv") %>%
  clean_names()
# review data
pokemon %>%
  glimpse()
## Rows: 952
## Columns: 13
## $ number     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ name       <chr> "Bulbasaur", "Ivysaur", "Venusaur", "Charmander", "Charmele…
## $ type1      <chr> "Grass", "Grass", "Grass", "Fire", "Fire", "Fire", "Water",…
## $ type2      <chr> "Poison", "Poison", "Poison", NA, NA, "Flying", NA, NA, NA,…
## $ total      <dbl> 318, 405, 525, 309, 405, 534, 314, 405, 530, 195, 205, 395,…
## $ hp         <dbl> 45, 60, 80, 39, 58, 78, 44, 59, 79, 45, 50, 60, 40, 45, 65,…
## $ attack     <dbl> 49, 62, 82, 52, 64, 84, 48, 63, 83, 30, 20, 45, 35, 25, 90,…
## $ defense    <dbl> 49, 63, 83, 43, 58, 78, 65, 80, 100, 35, 55, 50, 30, 50, 40…
## $ sp_atk     <dbl> 65, 80, 100, 60, 80, 109, 50, 65, 85, 20, 25, 90, 20, 25, 4…
## $ sp_def     <dbl> 65, 80, 100, 50, 65, 85, 64, 80, 105, 20, 25, 80, 20, 25, 8…
## $ speed      <dbl> 45, 60, 80, 65, 80, 100, 43, 58, 78, 45, 30, 70, 50, 35, 75…
## $ legendary  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ generation <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

The syntax for ggplot has three key components. The ggplot function call (ggplot()), the aesthetics (called aes()), and the geometry (called geoms) which refers to scatter, bar, or line plots for example. The next three code chunks break this down.

# call ggplot2 and add data
ggplot(pokemon)

Notice we just get a grey box. We have just loaded our data into ggplot but not much else! Now lets add the aesthetics and see what happens.

Too add aesthetics we use the aes() function within the ggplot() function, and specify what our x and y axis will be with column names from our data, sp_atk and sp_def in this case.

# add aesthetics
ggplot(pokemon, aes(x = sp_atk, y = sp_def))

It is starting to look more like a visualisation now, we can see the x and y axis labels, but we still have no data points showing. We have to add a geometry for that to happen. Notice the syntax here, we use the + icon to add a geometry to ggplot, which in this case is geom_point() which makes scatter plots. All geometry functions start with geom_ and end with the type of geometry such as point, bar, or line.

# pick which geometry
ggplot(pokemon, aes(x = sp_atk, y = sp_def)) +
  geom_point()

This is the fundamental concept of ggplot, you construct your visualisations in layers, adding geometry layers, and other features as you go.

what is ggplot exercise

Using the pokemon data, make a scatter plot with hp on the x axis and speed on the y axis.

# your code here

Scatter plots

Scatter plots are for displaying the relationship between two numeric (or quantitative) variables. For each data point, the values of its first variable is represented on the X axis and the second on the Y axis.

To make a scatter plot with ggplot2 we use the geom_point() function like you just saw. In order for ggplot to make a scatter plot, the X and Y axis must be numeric.

The plot we just made in the example is okay but it could do with some improving. There are quite a few different ways to change the appearance of a visualisation, lets go through them.

The first thing we will look at is adding some colour! There are a few options for adding colours to your plots. You can add the name, such as red, or you can use a hex code, or you can use a pre-defined palette. To add colour to a scatter plot we use the colour = argument.

# colour of points
ggplot(pokemon, aes(x = hp, y = speed)) +
  geom_point(colour = "orange")

To colour your points by a group (or factor) we have to add the colour argument into the aes() function. This allows us to have different colours for different groups, which makes the plot more informative.

In the below example, our data is coloured by if a pokemon is classified as legendary or not.

# colour by factor
ggplot(pokemon, aes(x = hp, y = speed, colour = legendary)) +
  geom_point()

We get the default ggplot colours which are okay. There are a few different ways of changing the colours, all methods use the scale_ function in a slightly different way. In the two examples below we have changed the colours using the RColorBrewer package and have set the colours manually.

RColorBrewer comes with a set of palettes for different situations, you can view them by following this link https://www.r-graph-gallery.com/38-rcolorbrewers-palettes.html. To use these palettes with ggplot we use the scale_colour_brewer() function with an argument for which palette we want to use; in this example we are using Set1.

library(RColorBrewer)
# adjusting colour by factor using RColorBrewer
ggplot(pokemon, aes(x = hp, y = speed, colour = legendary)) +
  geom_point() +
  scale_colour_brewer(palette = "Set1")

To make a manual palette, you first make a vector with your colours, to do so it is useful to use a colour picker such as http://tristen.ca/hcl-picker/#/hlc/6/1/15534C/E2E062 or https://coolors.co/. You copy the hex code (code with # then 6 numbers of letters) and paste it into your vector like you can see in the manual_pal vector below. To add the colour we use scale_colour_manual() function, and set the values to our manual palette.

# adjusting colour by factor using manual palette
manual_pal <- c("#90C0F8", "#EA964E")

ggplot(pokemon, aes(x = hp, y = speed, colour = legendary)) +
  geom_point() +
  scale_colour_manual(values = manual_pal)

It is sometimes helpful to view the palette before using it. We can use the scales package for this, which is installed when you install ggplot2. We provide the show_col() function with the palette we want to view and it returns a grid view of the colours. In the example we look at Set1 from RColorBrewer and the manual palette we just used.

# load scales
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
# view palettes
show_col(RColorBrewer::brewer.pal(n = 8, name = "Set1"))

show_col(manual_pal)

As well as changing the colour of the points, you can change their shape, size, and transparency (alpha). Just like with colour, we can define the size, shape or transparency either in the aes() function or in a geom_ function. By adding them to the geom_ function we manually change them. If we use them in aes() we have to associate the size/shape/alpha with a variable.

See the below example, first we manually set the size and alpha. In the second example we set the size to be defined by the total column in our pokemon data, and manually set the alpha.

# manually set size and alpha
ggplot(pokemon, aes(x = hp, y = speed, colour = legendary)) +
  geom_point(size = 5, alpha = 0.6) +
  scale_colour_brewer(palette = "Set1")

# manually set alpha, size by total
ggplot(pokemon, aes(x = hp, y = speed, colour = legendary, size = total)) +
  geom_point(alpha = 0.6) +
  scale_colour_brewer(palette = "Set1")

To manually change the shape and replace the circles, we give the shape argument a number. Each number represents a shape, letter, or number; by default ggplot uses shape number 19. We can change the shape to a square for example by using the number 15.

# default shape number
ggplot(pokemon, aes(x = hp, y = speed, colour = legendary, size = total)) +
  geom_point(alpha = 0.6, shape = 19) +
  scale_colour_brewer(palette = "Set1")

# shape number for squares
ggplot(pokemon, aes(x = hp, y = speed, colour = legendary, size = total)) +
  geom_point(alpha = 0.6, shape = 15) +
  scale_colour_brewer(palette = "Set1")

View the image with the visual markdown editor to see what number represents what shape, letter, or number.

Finally we can add a title and save our plot! We’ve done two things in order to achieve this. To add a title, and change axis labels, we have used the labs() function. We add arguments for what we want to change, such as title = "Pokemon Hit Points vs Speed". To change the legend labels we use colour and size, as we used these to define our legend in the aes() function.

To save the plot we assign our code to a variable, then we use the ggsave() function, which requires what you want to call the file and the file extension (e.g. plot.PNG or plot.JPG), then the ggplot object we created. Run the example below, and you should get a hp_vs_speed.PNG file where your Rmd file is saved. You can also adjust the size of the image saved using the width and height arguments.

# save plot to a variable
hp_vs_speed <- ggplot(pokemon, aes(x = hp, y = speed, colour = legendary, size = total)) +
  geom_point(alpha = 0.6, shape = 15) +
  scale_colour_brewer(palette = "Set1") +
  labs(title = "Pokemon Hit Points vs Speed",
       subtitle = "Taken from pokemondb.net",
       x = "Hit Points",
       y = "Speed",
       colour = "Legenary pokemon?",
       size = "Total stats")

hp_vs_speed

# save plot
ggsave("hp_vs_speed.PNG", hp_vs_speed)
## Saving 7 x 5 in image
# save with defined width and height
ggsave("hp_vs_speed.PNG", hp_vs_speed,
       width = 7, height = 4.5)

Scatter plots exercise

For the exercises in this workshop we will use data from the Olympics that includes all Olympic games from 1896 through to 2016. More information on the dataset can be found here https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md. Run the code provided to load the libraries and data into R.

We will make two scatter plots from the Olympics data. For both plots we will use dplyr to filter the information we are interested in, which has been done for you in this exercise.

  1. Using the provided scatter_plot1 data, make a scatter plot of Olympic gymnasts heights (x axis) and weights (y axis).
  • Change the colour and shape arguments to tell us what sex the gymnasts are.
  • Change the colour palette by making a manual one or using RColorBrewer.
  • Be sure to give your plot a title, and save your plot.
  1. Using the provided scatter_plot2 data, make a scatter plot of the age (y axis) of gymnastic medal winners by year (x axis).
  • Colour your plot by medal by making a manual colour palette. hint: the hex codes for gold, silver and bronze are: “#FFD700”, “#C0C0C0”, “#CD7F32”
  • Use shape to tell us what sex the gymnasts were.
  • Be sure to give your plot a title, and save your plot.
# make sure libraries are loaded
library(readr)
library(dplyr)
library(ggplot2)
library(RColorBrewer)

# load in data
olympics <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv")
## Rows: 271116 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, sex, team, noc, games, season, city, sport, event, medal
## dbl  (5): id, age, height, weight, year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
olympics %>% glimpse()
## Rows: 271,116
## Columns: 15
## $ id     <dbl> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, …
## $ name   <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar Lindenau…
## $ sex    <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "F", "M", "M", "M"…
## $ age    <dbl> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31, 33, 33,…
## $ height <dbl> 180, 170, NA, NA, 185, 185, 185, 185, 185, 185, 188, 188, 188, …
## $ weight <dbl> 80, 60, NA, NA, 82, 82, 82, 82, 82, 82, 75, 75, 75, 75, 75, 75,…
## $ team   <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherlands", "…
## $ noc    <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "NED", "NED", …
## $ games  <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer", "19…
## $ year   <dbl> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, 1994, 199…
## $ season <chr> "Summer", "Summer", "Summer", "Summer", "Winter", "Winter", "Wi…
## $ city   <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary", "Calgar…
## $ sport  <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed Skating"…
## $ event  <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightweight", …
## $ medal  <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
# data cleaning for first scatter plot
scatter_plot1 <- olympics %>%
  filter(sport == "Gymnastics")

# data cleaning for second scatter plot
scatter_plot2 <- olympics %>%
  filter(sport == "Gymnastics") %>%
  filter(!is.na(medal)) %>%
  mutate(medal = factor(medal, levels = c("Gold", "Silver", "Bronze")))

# your code here

Quirks of ggplot2

There are a few quirks to be aware of when using ggplot2 and you’ll see a few of them when you look for examples online. In order to aid with this, we can have a look at a few of them!

The first quirk is piping data into ggplot, where you do not need to add your data into the ggplot() function as it is piped in. The main advantage of this approach is you can string together some data cleaning and then pipe the results straight into ggplot.

# piping data into ggplot
pokemon %>%
  ggplot(aes(x = sp_atk, y = sp_def)) +
  geom_point()

# piping with filter
pokemon %>%
  filter(type1 == "Fire") %>%
  ggplot(aes(x = sp_atk, y = sp_def)) +
  geom_point()

The second quirk is adding aesthetics into a geom_ function rather than the ggplot() function.

# adding aesthetics into the geom_ call
ggplot(pokemon) +
  geom_point(aes(x = sp_atk, y = sp_def))

The third quirk is you can also add the data into the geom_ function. When doing so you have to have data = otherwise you will get an error.

# adding data and aesthetics into the geom_ call
ggplot() +
  geom_point(data = pokemon, aes(x = sp_atk, y = sp_def))

The fourth quirk relates to the second and third, in that you can add aesthetics into a geom_ function more than once. You might occasionally come across this for more complex visualisations.

In the example we will add the average of our x and y variables. First we make a summary table that has the averages of both axis’s, using summarise() from dplyr. Then we add two geom_point() functions, one with the pokemon data, and one with our summary table data.

# why adding aesthetics into the geom_ call
# calculate mean of sp_atk and sp_def
avg_sp <- pokemon %>%
  summarise(
    avg_sp_atk = mean(sp_atk, na.rm = TRUE),
    avg_sp_def = mean(sp_def, na.rm = TRUE))

avg_sp
## # A tibble: 1 × 2
##   avg_sp_atk avg_sp_def
##        <dbl>      <dbl>
## 1       71.3       70.7
# add average sp_atk and sp_def as black point
ggplot() +
  geom_point(data = pokemon, 
             aes(x = sp_atk, y = sp_def), 
             colour = "orange",
             size = 2.5) +
  geom_point(data = avg_sp, 
             aes(x = avg_sp_atk, y = avg_sp_def),
             size = 2.5)

The last quirk we will look at is adding to a ggplot visualisation after you have assigned it a name. This is very common in tutorials and on Stack Overflow. A good use of this is to build a base of the x and y you want to use and test out different geometries.

# saving plot then adding to it
p <- ggplot(pokemon, aes(x = sp_atk, y = sp_def))

p

p + geom_point()

p + geom_line()

Quirks of ggplot2 exercise

Make a visualisation of USA athletes ages vs heights, showing the difference between the genders using colour. When making your visualisation try to

  • Pipe the olympics data to a filter function and select all USA athletes
  • Pipe to a ggplot function
  • Add a geom_point function and add the aesthetics there rather than in ggplot()
# your code here

Bar plots with counts

Bar plots are used to show relationships between a numerical and categorical variable. The categorical variable is usually on the x axis, and the y axis is usually a frequency count.

By default, bar plots with ggplot only require an x or y axis. From that they make a frequency count of that variable. See the example below. First we use ggplot to make a bar plot to count the number of pokemon added in each generation. Then we do the same thing with dplyr to make a aggregate table, ggplot is taking this aggregate table and making into a plot for us!

It is important to make sure your x axis in a bar plot is a factor, as this helps ggplot to order the axis as you expect.

# make generation a factor
pokemon$generation <- factor(pokemon$generation)

# default bar plot
ggplot(pokemon, aes(x = generation)) +
  geom_bar()

# dplyr aggregate equivalent
pokemon %>%
  count(generation)
## # A tibble: 8 × 2
##   generation     n
##   <fct>      <int>
## 1 1            151
## 2 2             99
## 3 3            141
## 4 4            115
## 5 5            165
## 6 6             84
## 7 7             99
## 8 8             98

To add colour to your bar plot we use the fill argument rather than colour. This can be confusing, and sometimes if you forget, just try both till the colours look right! To add our fill manually we add the fill command to our geom_bar() function.

# manually add fill colour
ggplot(pokemon, aes(generation)) +
  geom_bar(fill = "purple")

Just like with the scatter plot, we can colour our plot by a variable by putting the fill argument within the aes() function. The below example also shows the equivalent when doing aggregation using dplyr.

# bar plot with colour by variable
ggplot(pokemon, aes(x = generation, fill = legendary)) +
  geom_bar()

# dplyr aggregate equivalent
pokemon %>%
  count(generation, legendary)
## # A tibble: 16 × 3
##    generation legendary     n
##    <fct>      <lgl>     <int>
##  1 1          FALSE       146
##  2 1          TRUE          5
##  3 2          FALSE        94
##  4 2          TRUE          5
##  5 3          FALSE       128
##  6 3          TRUE         13
##  7 4          FALSE       100
##  8 4          TRUE         15
##  9 5          FALSE       145
## 10 5          TRUE         20
## 11 6          FALSE        75
## 12 6          TRUE          9
## 13 7          FALSE        69
## 14 7          TRUE         30
## 15 8          FALSE        82
## 16 8          TRUE         16

Notice in the above example that the bars by default were stacked on top of each other. We have two other options for changing this with either a dodge setting (sit next to each other) or a fill setting (stacked and standarised). To change this we use the position argument within geom_bar().

# dodge bars
ggplot(pokemon, aes(x = generation, fill = legendary)) +
  geom_bar(position = "dodge")

# filled bars
ggplot(pokemon, aes(x = generation, fill = legendary)) +
  geom_bar(position = "fill")

A useful thing to change with bar plots is to flip your coordinates. This is particularly useful if your x axis contains text. In the example below we will use the type1 variable as our x axis to see the difference. When we don’t flip the coordinates, the x axis is hard to read.

ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar()

ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar() + 
  coord_flip()

To change our colours we use the scale_fill_ function. This is very similar to what we did with scatter plots except we are using fill this time, rather than colour.

# change fill with RColorBrewer
ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar() + 
  coord_flip() +
  scale_fill_brewer(palette = "Set1")

# change fill with manual palette
ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar() + 
  coord_flip() +
  scale_fill_manual(values = manual_pal)

Currently our plots have the default ggplot theme which has a grey background. We can change this by setting a new theme. To do so you use theme_ and select a theme which works best.

# change theme to black and white
ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar() + 
  coord_flip() +
  scale_fill_manual(values = manual_pal) +
  theme_bw()

# change theme to dark
ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar() + 
  coord_flip() +
  scale_fill_manual(values = manual_pal) +
  theme_dark()

Adding a theme to each plot can be tiring, so instead you can set a theme for all your plots by using the theme_set() function. Usually you set the theme before you make any of your visualisations. Now we have changed the theme to black and white, all our plots from now on will have a black and white theme.

# set global theme
theme_set(theme_bw())

# see result
ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar() + 
  coord_flip() +
  scale_fill_manual(values = manual_pal)

It is often useful and helpful to arrange the values by their rank or size. There are options to do this with base R, but the forcats library from the tidyverse makes arranging and ordering functions very straightforward.

We will use the fct_infreq() function, which means factors in frequency, in effect ordering our factors by the frequency they appear. There are two approaches. First we use the fct_infreq() function within ggplot, or second we arrange our factor outside ggplot. Outside of ggplot is usually better as you have more control and it make your ggplot code easier to read.

# load forcats
library(forcats)

# arrange by frequency within ggplot
ggplot(pokemon, aes(x = fct_infreq(type1), fill = legendary)) +
  geom_bar() + 
  coord_flip() +
  scale_fill_manual(values = manual_pal)

# arrange by frequency outside ggplot
pokemon$type1 <- fct_infreq(pokemon$type1)

ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar() + 
  coord_flip() +
  scale_fill_manual(values = manual_pal)

We can also reverse the ordering by putting putting our fct_infreq() function inside a fct_rev() function (stands for factor reverse).

# arrange by frequency (descending)
pokemon$type1 <- fct_rev(fct_infreq(pokemon$type1))

ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar() + 
  coord_flip() +
  scale_fill_manual(values = manual_pal)

More information on the forcats package can be found here: https://forcats.tidyverse.org/index.html

Finally, let’s save and label our example bar plot.

# save and label
count_type1 <- ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar() + 
  coord_flip() +
  scale_fill_manual(values = manual_pal) +
  labs(title = "Frequency of each Pokemon type",
       subtitle = "Coloured by if legendary or not",
       y = "Frequency of Pokemon type",
       x = "Type of Pokemon",
       fill = "Legendary pokemon?")

count_type1

ggsave("count_type1.PNG", count_type1)
## Saving 7 x 5 in image

Bar plots with counts exercise

Using the examples above, make a visualisation of the frequency of ski jump medal winners per country (team) from the Olympics dataset.

Try to include:

  • Setting a new theme using theme_set().
  • Order the x axis by the frequency in reverse order. hint: remember the forcats package
  • Make medals a factor, re-order them, and then colour them like we did in the last exercise.
  • Decide if position stack, dodge or fill work best with this visualisation.
  • Add a title and labels.
  • Save your visualisation.
# your code here

Bar plots with other statistics

A very useful function of bar plots is to show a group average instead of frequency. There are two approaches to showing a group average in a bar plot.

The first route is aggregate your dataset, then add it into your bar plot as shown in the example below. We first use group_by() and summarise() from dplyr to find an average, in this case the average total statistics by pokemon generation.

We then put this data into ggplot. The difference from a normal bar plot is we provide a y axis (our calculated average), and add stat = "identity" to the geom_bar() function.

This is a great approach as it is easy to see what is happening at each step, making it simple to identify issues and make changes if needed.

# group and summarise to make average
avg_total_gen <- pokemon %>%
  group_by(generation) %>%
  summarise(avg_total = mean(total, na.rm = TRUE))

# print result
avg_total_gen
## # A tibble: 8 × 2
##   generation avg_total
##   <fct>          <dbl>
## 1 1               408.
## 2 2               406.
## 3 3               408.
## 4 4               450.
## 5 5               435.
## 6 6               439.
## 7 7               459.
## 8 8               446.
# add to bar plot with stat identity
ggplot(avg_total_gen, aes(x = generation, y = avg_total)) +
  geom_bar(stat = "identity")

The other approach is to use the stat_summary() function to perform the same plot. The difference from a normal bar plot is we again provide the y axis but provide the variable we want to aggregate, total in this case. We then call stat_summary() and add two arguments, the function we want to use and what type of geometry to use; we’ve used mean and bar.

While this is less code, which is a good thing, it is hard to understand the steps taken to make the summary.

ggplot(pokemon, aes(x = generation, y = total)) +
  stat_summary(fun = "mean", geom = "bar")

We can also add error bars to our plots to help us understand how precise our average measure is. To add error bars it is generally easier to use the group_by and summarise approach. We will look at two types of error bars, the standard deviation and the standard error of the mean.

The standard deviation indicates how close sample values are to the average of all data points, and the accuracy of the average. The standard error of the mean is the discrepancy of the sample mean and the true mean, telling you the accuracy of the sample mean.

To calculate, we do the same aggregation as we did before but add sd (standard deviation) to the summarise function and calculate the sem (standard error of the mean) in a mutate function.

# group and summarise to make average and sd per group
avg_total_gen <- pokemon %>%
  group_by(generation) %>%
  summarise(avg_total = mean(total, na.rm = TRUE),
            sd = sd(total, na.rm = TRUE)) %>%
  mutate(sem = sd/sqrt(length(sd)))

# print result
avg_total_gen
## # A tibble: 8 × 4
##   generation avg_total    sd   sem
##   <fct>          <dbl> <dbl> <dbl>
## 1 1               408.  99.9  35.3
## 2 2               406. 112.   39.7
## 3 3               408. 117.   41.2
## 4 4               450. 115.   40.7
## 5 5               435. 108.   38.2
## 6 6               439. 116.   40.9
## 7 7               459. 123.   43.6
## 8 8               446. 125.   44.3

To add error bars we use the geom_errorbar() function, which requires two arguments within an aes() function, the ymin and ymax. To find ymin or ymax we plus or minus our avg_total (y axis value) by the sd/sem.

# adding standard deviation error bars
ggplot(avg_total_gen, aes(x = generation, y = avg_total)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = avg_total-sd, ymax = avg_total+sd)) +
  labs(title = "Average Pokemon total statistics by generation",
       subtitle = "Error bars indicate standard deviation")

# adding standard error bars
ggplot(avg_total_gen, aes(x = generation, y = avg_total)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = avg_total-sem, ymax = avg_total+sem)) +
  labs(title = "Average Pokemon total statistics by generation",
       subtitle = "Error bars indicate standard error of the mean")

You can edit the look of the error bars, such as making them narrower and changing the colour. See the example below on how to do this. We’ve also changed the colour of the bars too.

ggplot(avg_total_gen, aes(x = generation, y = avg_total)) +
  geom_bar(stat = "identity", fill = "orange") +
  geom_errorbar(aes(ymin = avg_total-sem, ymax = avg_total+sem), width = 0.3, colour = "darkblue") +
  labs(title = "Average Pokemon total statistics by generation",
       subtitle = "Error bars indicate standard error of the mean")

If you want to add error bars to bar plots with different groupings on the x axis we need to made a few subtle changes, the main change is we need to have a dodge bar chart.

First we will re run our avg_total_gen aggregation and add another column to our group_by. We then pre-define how wide the bars and error bars should be. Instead of using position = "dodge" we use our dodge variable we just made, and add the fill to be legendary (our second grouping).

# group by legendary as well
avg_total_gen <- pokemon %>%
  group_by(generation, legendary) %>%
  summarise(avg_total = mean(total, na.rm = TRUE),
            sd = sd(total, na.rm = TRUE)) %>%
  mutate(sem = sd/sqrt(length(sd)))
## `summarise()` has grouped output by 'generation'. You can override using the
## `.groups` argument.
# pre-define the dodge position
dodge <- position_dodge(width = 0.8)

ggplot(avg_total_gen, aes(x = generation, y = avg_total, fill = legendary)) +
  geom_bar(stat = "identity", position = dodge) +
  geom_errorbar(aes(ymin = avg_total-sem, ymax = avg_total+sem), position = dodge, width = 0.3) +
  labs(title = "Average Pokemon total statistics by generation",
       subtitle = "Error bars indicate standard error of the mean") +
  scale_fill_manual(values = manual_pal)

Bar plots with other statistics exercise

Using the examples above and the Olympics dataset, make a visualisation of the average age (mean or median) of GBR (Great Britain) medal winners by medal type and gender, making sure to

  • show the difference between male and female athletes using colours
  • show error bars for either standard deviation or standard error of the mean
  • colour, label and save your visualisation

hint: don’t forgot to use dodge <- position_dodge(width = 0.8)

# your code here

Beyond bar plots

Bar plots are not the only option to view aggregated data, and there are some sources that suggest bar plots are less than ideal for any visualisation other than showing the frequency of a continuous variable. See https://paulvanderlaken.com/2018/12/17/avoid-bar-plots-for-continuous-data-do-this-instead/ for details on this.

Fortunately, there are alternatives, such as box plots which will be covered in the second data visualisation workshop, or we can use scatter plots! Scatter plots allow us to see all the data and we can add on an average, the best of both worlds.

In order to recreate what we just did with bar plots with scatter plots we can either use both geom_point() and stat_summary(), or make a summary table and add that using a second geom_point() function. First, lets just plot the data as a scatter plot, making the points larger and more transparent. Lowering the transparency (alpha) is important in these plots as darker colours indicate a higher density of data points.

ggplot(pokemon, aes(x = generation, y = total)) +
  geom_point(size = 5, alpha = .33)

Now we can add the stat_summary() function. We are going to use the mean, the geom is point, and the shape is a the - symbol (number 95); we will also make the shape larger so we can see it easier.

# using stat_summary
ggplot(pokemon, aes(x = generation, y = total)) +
  geom_point(size = 5, alpha = .33) +
  stat_summary(fun = mean, geom = "point",
               shape = 95, size = 20)

If we use the summary table option we first make a summary table with group_by() and summarise(). Then we add two geom_point() functions. The first has the pokemon data and our x and y axis. The second is our summary table, with the same x axis and the avg_total as the y axis.

# summary table option
gen_avg_total <- pokemon %>%
  group_by(generation) %>%
  summarise(avg_total = mean(total, na.rm = TRUE))

gen_avg_total
## # A tibble: 8 × 2
##   generation avg_total
##   <fct>          <dbl>
## 1 1               408.
## 2 2               406.
## 3 3               408.
## 4 4               450.
## 5 5               435.
## 6 6               439.
## 7 7               459.
## 8 8               446.
ggplot() +
  geom_point(data = pokemon,
             aes(x = generation, y = total),
             size = 5, alpha = .33) +
  geom_point(data = gen_avg_total,
             aes(x = generation, y = avg_total),
             shape = 95, size = 20)

Either option works well, but for the rest of the examples we will use the stat_summary() option as it is less code.

Now we have all our data so we can see the number of points for each group, and we can see the average per group!

Finally, we can add colour by our grouped variable (legendary) and change the colour palette. Just like with the bar plots we can adjust the positioning from stack to dodge. The examples below show both stack and dodge versions.

# position stacked
ggplot(pokemon, aes(x = generation, y = total, colour = legendary)) +
  geom_point(size = 5, alpha = 0.3) + 
  stat_summary(fun = mean, geom = "point",
               shape = 95, size = 20) +
  scale_colour_brewer(palette = "Set1")

# position dodge
dodge <- position_dodge(width = 0.8)

ggplot(pokemon, aes(x = generation, y = total, colour = legendary)) +
  geom_point(size = 5, alpha = 0.3, position = dodge) + 
  stat_summary(fun = mean, geom = "point",
               shape = 95, size = 20,
               position = dodge) +
  scale_colour_brewer(palette = "Set1")

Beyond bar plots exercise

Recreate your last visualisation, average age (mean or median) of GBR (Great Britain) medal winners by medal type and gender, using the geom_point() and stat_summary() method detailed above.

# your code here

Individual coding challenge

For the individual coding challenge we will be using the food consumption data from tidy Tuesday: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-18/readme.md.

Use what we have covered in this workshop to make two visualisations of this dataset:

  • A scatter plot showing consumption and co2 emissions for a selected country (e.g. UK or France)
  • A bar plot of average co2 emissions per food category. Display just six countries to compare, such as UK, France, Germany etc. and colour them.

Use some of the tips we used and showed to make the visualisations have labels, colours and look appealing. Try and have some fun with it! =)

food_consumption <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv')
## Rows: 1430 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, food_category
## dbl (2): consumption, co2_emmission
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
food_consumption %>%
  glimpse()
## Rows: 1,430
## Columns: 4
## $ country       <chr> "Argentina", "Argentina", "Argentina", "Argentina", "Arg…
## $ food_category <chr> "Pork", "Poultry", "Beef", "Lamb & Goat", "Fish", "Eggs"…
## $ consumption   <dbl> 10.51, 38.66, 55.48, 1.56, 4.36, 11.39, 195.08, 103.11, …
## $ co2_emmission <dbl> 37.20, 41.53, 1712.00, 54.63, 6.96, 10.46, 277.87, 19.66…
# your code here

Understanding which visualisation to use and when

Sometimes it can be hard to know where to start with a visualisation. A great first starting point is understanding the options depending on the data types you have available. This website gives lots of information and visual guides on this process: https://www.data-to-viz.com/

Seeing what others have done with this data

The Olympics data we used for the exercises today is from the Tidy Tuesday GitHub repository. Tidy Tuesday is a social data visualisation challenge that happens every week and is a great way of learning about data viz.

The the link below to see what others have done and posted about using the Olympics data. Use it to get some ideas on what else you can try and do or get some inspiration from others. https://twitter.com/search?lang=en&q=%23tidytuesday%20olympics&src=typed_query

Fun extra

As a fun extra you can manually determine shapes in your visualisation using scale_shape_manual(). We’ve also removed the guide which was unnecessary by using guide = "none".

In the example below, as our x axis is generation from 1 to 8, we can make generation 1 have a shape of the number 1 and so on.

ggplot(pokemon, aes(x = generation, y = total, shape = generation)) +
  geom_point(size = 5, alpha = .33) +
  stat_summary(fun = mean, geom = "point",
               shape = 95, size = 20) +
  scale_shape_manual(values = c(49:56),
                     guide = "none")


R Data Visualistion 2 - Objective of workshop

To create histograms, box, and time series plots using the ggplot2 package.

What this workshop will cover

In this workshop, the aim is to cover how to work with dates in plots, and use histograms and box plots. We will be covering:

  • How to make box plots with ggplot2
  • Displaying distributions with histograms
  • Working with dates with the lubridate package
  • How to make time series line plots
  • How to split your data into facet grids

In this data visualisation workshop we will be building on the concepts learnt in the first workshop, constructing visualisations using the ggplot2 library.

We will be using one new package called lubridate, a tidyverse package which is designed to make working with dates and times easier; this will help us in making time series visualisations. Run the the code below to install lubridate.

# install lubridate
install.packages("lubridate")

Before we start we will need to load the libraries we will be using during this session. Run the code below to load your libraries.

# libraries we will be using
library(ggplot2)
library(dplyr)
library(lubridate)
library(readr)
library(janitor)
library(RColorBrewer)

Box plots

Box plots are designed to compare the differences of a categorical variable (samples or groups). They do this by displaying the summary statistics of a continuous variable (e.g. numeric) for each categorical variable.

The summary statistics shown are:

  • The median (middle value)
  • Interquartile range, known as IQR, which has values from 25% to 75% (or 25th to 75th percentile)
  • First quartile, known as Q1, which has a value of 25%
  • Second quartile, known as Q3, which has a value of 75%
  • “minimum” value, calculated as Q1 - 1.5*IQR
  • “maximum” value, calculated as Q3 + 1.5*IQR
  • Outlier, which are values that fall outside of the maximum or minimum values

We will use data from the Pokémon games again for our examples for box plots, which was web scraped from https://pokemondb.net/pokedex/all.

# load and clean names
pokemon <- read_csv("https://raw.githubusercontent.com/andrewmoles2/webScraping/main/R/data/pokemon.csv") %>%
  clean_names()
# review data
pokemon %>%
  glimpse()
## Rows: 952
## Columns: 13
## $ number     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ name       <chr> "Bulbasaur", "Ivysaur", "Venusaur", "Charmander", "Charmele…
## $ type1      <chr> "Grass", "Grass", "Grass", "Fire", "Fire", "Fire", "Water",…
## $ type2      <chr> "Poison", "Poison", "Poison", NA, NA, "Flying", NA, NA, NA,…
## $ total      <dbl> 318, 405, 525, 309, 405, 534, 314, 405, 530, 195, 205, 395,…
## $ hp         <dbl> 45, 60, 80, 39, 58, 78, 44, 59, 79, 45, 50, 60, 40, 45, 65,…
## $ attack     <dbl> 49, 62, 82, 52, 64, 84, 48, 63, 83, 30, 20, 45, 35, 25, 90,…
## $ defense    <dbl> 49, 63, 83, 43, 58, 78, 65, 80, 100, 35, 55, 50, 30, 50, 40…
## $ sp_atk     <dbl> 65, 80, 100, 60, 80, 109, 50, 65, 85, 20, 25, 90, 20, 25, 4…
## $ sp_def     <dbl> 65, 80, 100, 50, 65, 85, 64, 80, 105, 20, 25, 80, 20, 25, 8…
## $ speed      <dbl> 45, 60, 80, 65, 80, 100, 43, 58, 78, 45, 30, 70, 50, 35, 75…
## $ legendary  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ generation <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

For these examples we will just look at one type of Pokémon, the electric type; the most famous of which is Pikachu! First, we extract just the electric type Pokémon, and make relevant columns factors.

# select columns to convert to factor
to_factor <- c("type1", "type2", "generation")

# extract just electric pokemon and make cols factors
electric_pokemon <- pokemon %>%
  filter(type1 == "Electric" | type2 == "Electric") %>%
  mutate(across(all_of(to_factor), factor))

head(electric_pokemon)
## # A tibble: 6 × 13
##   number name      type1    type2 total    hp attack defense sp_atk sp_def speed
##    <dbl> <chr>     <fct>    <fct> <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl> <dbl>
## 1     25 Pikachu   Electric <NA>    320    35     55      40     50     50    90
## 2     26 Raichu    Electric <NA>    485    60     90      55     90     80   110
## 3     81 Magnemite Electric Steel   325    25     35      70     95     55    45
## 4     82 Magneton  Electric Steel   465    50     60      95    120     70    70
## 5    100 Voltorb   Electric <NA>    330    40     30      50     55     55   100
## 6    101 Electrode Electric <NA>    490    60     50      70     80     80   150
## # … with 2 more variables: legendary <lgl>, generation <fct>

To make a box plot in ggplot we use the geom_boxplot() geom function. One of our axis variables has to be categorical and the other has to be numeric. In the below example we will use generation (categorical) and total (numeric).

# generation by total
ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot()

From the output we see a few things. First is that each box has a line through the middle which indicates the median; the box itself is our interquartile range. The lines above and below the boxes (known as whiskers) are the maximum and minimum values. The black dots indicate outliers, which have fallen outside our max and min values.

Just like with scatter and bar plots we can change the colours! You can use either fill or colour arguments with box plots, but fill tends to look better.

We will use the colour of Pikachu to colour our boxes. We used the pokemon colour picker to get the colour of pikachu: https://pokepalettes.com/#pikachu

ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot(fill = "#f6e652")

Sometimes it is useful to remove the outliers. To do so you add in the outlier.shape = NA argument.

ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot(fill = "#f6e652", outlier.shape = NA)

Displaying outliers is usually a good idea so we will keep them for now, and change the colour and shape of them. To adjust these we use outlier.colour and outlier.shape argments. We’ve used the colour of Pikachu’s cheeks as the outlier colour and made the shape square.

ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot(fill = "#f6e652", outlier.colour = "#c52018",
               outlier.shape = 15)

Box plots exercise

For the exercises for this workshops we will be using daily COVID data that is collected from most of the countries around the world.

COVID data is from our world in data, which is stored in a GitHub repository. More information on the data and what each variable means can be found here: https://github.com/owid/covid-19-data/tree/master/public/data

# load in covid data and select cases, deaths and vaccines
covid <- read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv") %>%
  select(iso_code:new_deaths_smoothed_per_million, contains("vaccin"),
         population, median_age, gdp_per_capita)
## Rows: 181452 Columns: 67
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (4): iso_code, continent, location, tests_units
## dbl  (62): total_cases, new_cases, new_cases_smoothed, total_deaths, new_dea...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# have a quick look at the data
covid %>% glimpse()
## Rows: 181,452
## Columns: 30
## $ iso_code                                   <chr> "AFG", "AFG", "AFG", "AFG",…
## $ continent                                  <chr> "Asia", "Asia", "Asia", "As…
## $ location                                   <chr> "Afghanistan", "Afghanistan…
## $ date                                       <date> 2020-02-24, 2020-02-25, 20…
## $ total_cases                                <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ new_cases                                  <dbl> 5, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ new_cases_smoothed                         <dbl> NA, NA, NA, NA, NA, 0.714, …
## $ total_deaths                               <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ new_deaths                                 <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ new_deaths_smoothed                        <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ total_cases_per_million                    <dbl> 0.126, 0.126, 0.126, 0.126,…
## $ new_cases_per_million                      <dbl> 0.126, 0.000, 0.000, 0.000,…
## $ new_cases_smoothed_per_million             <dbl> NA, NA, NA, NA, NA, 0.018, …
## $ total_deaths_per_million                   <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ new_deaths_per_million                     <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ new_deaths_smoothed_per_million            <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ total_vaccinations                         <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ people_vaccinated                          <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ people_fully_vaccinated                    <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ new_vaccinations                           <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ new_vaccinations_smoothed                  <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ total_vaccinations_per_hundred             <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ people_vaccinated_per_hundred              <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ people_fully_vaccinated_per_hundred        <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ new_vaccinations_smoothed_per_million      <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ new_people_vaccinated_smoothed             <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ new_people_vaccinated_smoothed_per_hundred <dbl> NA, NA, NA, NA, NA, NA, NA,…
## $ population                                 <dbl> 39835428, 39835428, 3983542…
## $ median_age                                 <dbl> 18.6, 18.6, 18.6, 18.6, 18.…
## $ gdp_per_capita                             <dbl> 1803.987, 1803.987, 1803.98…

For this exercise will we make two box plots from our data looking more at the demographics of each continent (we will look at cases and vaccines later).

Your two box plots should show the following:

  • The median age of each continent
  • The gdp per capita for each continent
  • Make sure to change the colour of the boxes and outliers to make it look better!
  • Try changing the shape and size of your outlier

Hint: you will have to remove the na values from continent before plotting, e.g. covid %>% filter(!is.na(continent))

Hint: You can pipe from your filter function straight into ggplot2!

Hint: You can add colours in lots of ways but it can be fun to use a colour picker http://tristen.ca/hcl-picker/#/hlc/11/1.1/DC7261/D77357.

# your code here

Improving your box plots

The main issue with box plots, in a similar way to bar plots, is they can hide data. We can fix this by adding a scatter plot over the top of the boxes so we can see the full distribution of the data.

When adding in a scatter plot, we won’t need our outliers as the scatter plot will show these for us. We will need to remove them using the outlier.shape = NA argument.

ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot(fill = "#f6e652", outlier.shape = NA) +
  geom_point()

Some of our data points are overlapping which makes it a little hard to see all the data. We can fix this by changing the position of our points using the position = "jitter" argument. We can also use geom_jitter() which is a short hand for geom_point(position = "jitter"); we will use geom_jitter() going forward as it is less typing.

# change position in geom_point
ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot(fill = "#f6e652", outlier.shape = NA) +
  geom_point(position = "jitter")

# using geom_jitter
ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot(fill = "#f6e652", outlier.shape = NA) +
  geom_jitter()

We can also add in a colour grouping to our points to make them more meaningful. We add the colour aesthetic to our geom_jitter function. In the example we are colouring our points by if a pokemon is legendary or not.

ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot(fill = "#f6e652", outlier.shape = NA) +
  geom_jitter(aes(colour = legendary))

Finally we can change the colours of our points, which in this case we have done manually. Again, the colours were taken from the pokemon colour picker of pikachu: https://pokepalettes.com/#pikachu.

ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot(fill = "#f6e652", outlier.shape = NA) +
  geom_jitter(aes(colour = legendary)) +
  scale_colour_manual(values = c("#c52018", "#41414a"))

Now we can add a title and save the plot! When saving the plot we have manually adjusted the width of the plot. You can also change the height.

electric_pokemon_box <- ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot(fill = "#f6e652", outlier.shape = NA) +
  geom_jitter(aes(colour = legendary)) +
  scale_colour_manual(values = c("#c52018", "#41414a")) +
  labs(title = "Summary of electric pokemon for each generation") +
  theme_bw()

electric_pokemon_box

ggsave("electric_pokemon_box.png", electric_pokemon_box,
       width = 5.5)
## Saving 5.5 x 5 in image

Improving your box plots exercise

For this exercise we will look at vaccines! We will look at 10 countries to see the difference in vaccine distribution; 5 have low gdp and 5 have high gdp. The data will be pre-prepared for you. We have made a vector with the counties that have high and low gdp. Then we have filtered our covid data by this vector, and made the location a ordered factor.

  1. Make a box plot using the covid_select_countries data, with x = location and y = total_vaccinations_per_hundred. Be sure to include geom_jitter().
  2. Now improve the look of your box plot! Change the colour of the boxes and the points, make the points more transparent, remove the outliers, change the theme, and flip the co-ordinates.
  3. Make another box plot the same way but use the people_fully_vaccinated variable as your y axis.
  4. Give both your box plots a title and change the axis labels (if you want).
  5. Save your plots using ggsave(). You will need to assign the plots to a variable first.
# Make vector with low and high gdp countries
high_low_gdp <- c("Sierra Leone", "Ethiopia","Yemen", 
                  "Zambia", "Nepal", "Sweden", "Australia",
                  "Saudi Arabia", "Germany", "United Kingdom")

# Only include locations in high_low_gdp
# Make location a factor, ordered by high_low_gdp
covid_select_countries <- covid %>%
  filter(location %in% high_low_gdp) %>%
  mutate(location = factor(location, levels = high_low_gdp))

# your code here

Displaying distributions with histograms

Histograms are great for visualising the distribution of numeric data. Histograms have one numerical variable as their input.

To make a histogram with ggplot we provide a numerical value to our x axis, and use the geom_histogram() geom. In the example we are using all the pokemon data and showing the distribution of the total column.

ggplot(pokemon, aes(x = total)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can adjust the size of the bins of our plot with two methods, changing the binwidth or selecting the amount of bins. When we talk about bins with histograms it refers to the size of each bar; the larger the bar the more data on the x axis is included.

The first example uses binwidth. The number you provide is directly related to your x axis. In our example we are using the total column which goes up to 754. If we have binwidth = 8, then 8 data points will be included in each bin. Run the two examples below with a smaller and larger binwidth to see the results.

# summary stats for total column
summary(pokemon$total)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   175.0   325.0   450.0   429.5   505.0   754.0
# binwidth of 8
ggplot(pokemon, aes(x = total)) +
  geom_histogram(binwidth = 8) +
  labs(title = "Small binwidth (8)")

# binwidth of 50
ggplot(pokemon, aes(x = total)) +
  geom_histogram(binwidth = 50) +
  labs(title = "Larger binwidth (50)")

The other method is to select the number of bins to use, using the bins argument. The more bins we use, the less data will be contained in each bin. In the example below we have bins with lots of data bins = 10 and bins with less data bins = 50. Which do you think is best?

# using 10 bins
ggplot(pokemon, aes(x = total)) +
  geom_histogram(bins = 10) +
  labs(title = "Less bins = more data in each bin")

# using 50 bins
ggplot(pokemon, aes(x = total)) +
  geom_histogram(bins = 50) +
  labs(title = "More bins = less data in each bin")

It can be helpful to colour your histogram by a categorical variable. This works the same as a box plot, using the fill argument. In the example we have filled our histogram by the legendary category.

ggplot(pokemon, aes(x = total, fill = legendary)) +
  geom_histogram(binwidth = 20)

Another useful method is to use facets, which split up your data by a categorical variable and presents them in a grid like formation.

There are two techniques in ggplot to make facets, using facet_grid() or facet_wrap(). To use facet_grid() we define if we want to display our data row-wise (rows =) or column-wise (cols =). When defining which column to split our data by we need to use the vars() function. See the two examples below on how to do a row or column facet grid.

# row-wise display
ggplot(pokemon, aes(x = total, fill = legendary)) +
  geom_histogram(binwidth = 20) +
  facet_grid(rows = vars(legendary)) +
  labs(title = "Row-wise facet grid")

# column-wise display
ggplot(pokemon, aes(x = total, fill = legendary)) +
  geom_histogram(binwidth = 20) +
  facet_grid(cols = vars(legendary)) +
  labs(title = "column-wise facet grid")

The other option is facet_wrap(), which by default only needs the column you want to split your data by. It does allow extra specification with the nrow and ncol functions, allowing you to define how many rows and columns to display.

In the examples below we show the default facet_wrap, and how to adjust the column or row specification. We have used the generation column as it has more groups.

# default facet_wrap
ggplot(pokemon, aes(x = total, fill = legendary)) +
  geom_histogram(binwidth = 20) +
  facet_wrap(vars(generation)) +
  labs(title = "Default facet wrap")

# 4 rows
ggplot(pokemon, aes(x = total, fill = legendary)) +
  geom_histogram(binwidth = 20) +
  facet_wrap(vars(generation),
             nrow = 4) +
  labs(title = "Facet wrap with 4 rows")

# 4 columns
ggplot(pokemon, aes(x = total, fill = legendary)) +
  geom_histogram(binwidth = 20) +
  facet_wrap(vars(generation),
             ncol = 4) +
  labs(title = "Facet wrap with 4 columns")

Displaying distributions exercise

For this exercise we will be making a histogram of using the people_fully_vaccinated_per_hundred column for each continent

  • Make a histogram with people_fully_vaccinated_per_hundred as your x axis
  • Add a fill argment with continent
  • Adjust the binwidth or bins (e.g. binwidth = 5 looks good)
  • Using RColourBrewer, adjust the colours used in fill

Hint: you will have to remove the na values from continent before plotting, e.g. covid %>% filter(!is.na(continent))

Hint: You can pipe from your filter function straight into ggplot2!

Hint: To change the fill colours you can use scale_fill_brewer(palette = "a palette")

Hint: Use brewer.pal.info to find RColorBrewer palettes

# your code here

Working with the date data type with lubridate

Working with the date data type when programming can be a bit tricky for many reasons. There are different formats, time zones, and the challenge extracting information from the date. Fortunately, the lubridate package comes to the rescue!

There are three types of date data type: date (2010-09-01), time (15:08:52 BST), date-time (2010-09-01 15:08:52 BST). For this workshop we will be focusing on the date type as it is the most common.

You can find out today’s date (more useful than it sounds) or the date and time using the today() or now() functions.

# make sure dplyr and lubridate are loaded
library(dplyr)
library(lubridate)

# get today's date
today()
## [1] "2022-04-25"
# today's date and time
now()
## [1] "2022-04-25 16:14:25 BST"
# make today's date a variable
today_date <- today()

A great feature of lubridate is extracting the year, month, day, or week day information from your date. We can test it out on today’s date. Run the code to see how the output.

# year
year(today_date)
## [1] 2022
# month
month(today_date)
## [1] 4
month(today_date, label = TRUE)
## [1] Apr
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
# week
week(today_date)
## [1] 17
# day
day(today_date)
## [1] 25
# weekday
wday(today_date)
## [1] 2
wday(today_date, label = TRUE)
## [1] Mon
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

Notice that for the month and wday functions we have the option to add labels. This can be very useful, making your month or week day outputs more readable.

For the rest of the examples we will use some randomised made up data containing daily sleep, and step information. Run the code below to see the data.

note: to make this data we have used randomisation functions: sample, runif and rnorm, if you are interested look them up to see how they work

# make some random data
df <- data.frame(
  date = seq(as.Date("2019-01-01"), as.Date("2021-12-01"), by = "days"),
  hours_sleep = round(rnorm(1066, mean = 9, sd = 1.5)),
  steps = round(rnorm(1066, mean = 8000, sd = 2000))
)

head(df)
##         date hours_sleep steps
## 1 2019-01-01           9  7084
## 2 2019-01-02           8  7518
## 3 2019-01-03           8  7281
## 4 2019-01-04           8  9719
## 5 2019-01-05           9  6302
## 6 2019-01-06          10  9760

We can now use the mutate function to make a year, month, week, day, and week day column.

df <- df %>%
  mutate(year = year(date),
         month = month(date, label = TRUE),
         week = week(date),
         day = day(date),
         week_day = wday(date, label = TRUE))

head(df)
##         date hours_sleep steps year month week day week_day
## 1 2019-01-01           9  7084 2019   Jan    1   1      Tue
## 2 2019-01-02           8  7518 2019   Jan    1   2      Wed
## 3 2019-01-03           8  7281 2019   Jan    1   3      Thu
## 4 2019-01-04           8  9719 2019   Jan    1   4      Fri
## 5 2019-01-05           9  6302 2019   Jan    1   5      Sat
## 6 2019-01-06          10  9760 2019   Jan    1   6      Sun
# see the breakdown of the date
df[1:2, c("date", "year", "month", "week", "day", "week_day")]
##         date year month week day week_day
## 1 2019-01-01 2019   Jan    1   1      Tue
## 2 2019-01-02 2019   Jan    1   2      Wed

Breaking the date down in this way allows us to do some aggregation of our data by the year, month, week, day, or weekday! In the examples below we have shown year and weekday.

# aggregate by year
df %>%
  group_by(year) %>%
  summarise(avg_sleep = mean(hours_sleep),
            avg_steps = mean(steps),
            total_steps = sum(steps))
## # A tibble: 3 × 4
##    year avg_sleep avg_steps total_steps
##   <dbl>     <dbl>     <dbl>       <dbl>
## 1  2019      9.19     8031.     2931381
## 2  2020      8.83     7897.     2890452
## 3  2021      8.92     8030.     2689951
# aggregate by week day
df %>%
  group_by(week_day) %>%
  summarise(avg_sleep = mean(hours_sleep),
            avg_steps = mean(steps),
            total_steps = sum(steps))
## # A tibble: 7 × 4
##   week_day avg_sleep avg_steps total_steps
##   <ord>        <dbl>     <dbl>       <dbl>
## 1 Sun           8.97     8004.     1216605
## 2 Mon           8.88     8045.     1222870
## 3 Tue           9.08     8063.     1233563
## 4 Wed           9.12     8122.     1242598
## 5 Thu           8.80     7824.     1189293
## 6 Fri           9.05     8004.     1216561
## 7 Sat           8.94     7831.     1190294

There are more functions from the lubridate package that we won’t be able to cover in this session, so do have a look at the package website for more information - https://lubridate.tidyverse.org/index.html - and checkout the R for Data Science chapter on dates - https://r4ds.had.co.nz/dates-and-times.html.

lubridate exercise

Using the examples above, extract year, month, day, day of week from covid data, and do an aggregation!

  1. Add new columns to your covid data for year, month, week, day and week_day. Try to add labels to month and week_day.
  2. Aggregate your covid data by year and month to find the mean total cases per million and mean total deaths per million.
  3. Print out the result.
# your code here

# separate date column
covid <- covid %>%
  mutate(year = year(date),
         month = month(date, label = TRUE),
         week = week(date),
         day = day(date),
         week_day = wday(date, label = TRUE))

# make year and month aggregate
avg_year_month_covid <- covid %>%
  group_by(year, month) %>%
  summarise(
    avg_total_cases_per_mil = mean(total_cases_per_million, na.rm = TRUE),
    avg_total_deaths_per_mil = mean(total_deaths_per_million, na.rm = TRUE)
    )
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
avg_year_month_covid
## # A tibble: 28 × 4
## # Groups:   year [3]
##     year month avg_total_cases_per_mil avg_total_deaths_per_mil
##    <dbl> <ord>                   <dbl>                    <dbl>
##  1  2020 Jan                     0.642                   0.0318
##  2  2020 Feb                     3.27                    0.325 
##  3  2020 Mar                   128.                      8.39  
##  4  2020 Apr                   606.                     34.0   
##  5  2020 May                  1078.                     58.9   
##  6  2020 Jun                  1654.                     74.3   
##  7  2020 Jul                  2414.                     92.9   
##  8  2020 Aug                  3434.                    114.    
##  9  2020 Sep                  4641.                    136.    
## 10  2020 Oct                  6508.                    161.    
## # … with 18 more rows

Time series plots

Time series plots visualise data over a period of time, which can be hourly, daily, weekly, monthly, or yearly. It is a great way to view trends over time. When plotting a time series, the x axis is the date and the y axis is your measure.

The most simple form of a time series visualisation in R is to use an unedited date variable. Using our example data (df) we will visualise how steps have changed each day.

# daily time series
df %>%
  ggplot(aes(x = date, y = steps)) +
  geom_line()

As we can see it is pretty variable how many steps are taken each day, as you might expect. There is a lot of data here so it is hard to see any real patterns, it just looks like noise! To solve this we can aggregate our data by the year, the month or the week to see if we can get any more insights.

For the example data we have it might be interesting to see the average of how many steps are taken on average each month, and to also compare this year on year.

We first aggregate our data, grouping by the month and year columns we made with the lubridate package, find the average steps, and convert the year column into a factor to make plotting easier; month is already a factor.

# aggregated time series by month
monthly_steps <- df %>%
  group_by(month, year) %>%
  summarise(avg_steps = mean(steps)) %>%
  mutate(year = factor(year))
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
monthly_steps
## # A tibble: 36 × 3
## # Groups:   month [12]
##    month year  avg_steps
##    <ord> <fct>     <dbl>
##  1 Jan   2019      7948.
##  2 Jan   2020      7574.
##  3 Jan   2021      7758.
##  4 Feb   2019      8619.
##  5 Feb   2020      7767.
##  6 Feb   2021      8146.
##  7 Mar   2019      8565.
##  8 Mar   2020      8108.
##  9 Mar   2021      8000.
## 10 Apr   2019      7361.
## # … with 26 more rows

Now we can make a time series by month! It is often helpful when using geom_line() to also pair it with geom_point() so we can see each data point clearly as well as seeing the trends with shown by the lines.

ggplot(monthly_steps,
       aes(x = month, y = avg_steps)) +
  geom_line() +
  geom_point()

That didn’t work as expected! As our data is grouped by year and month we need to use the group = argument to tell ggplot we want to connect the months up.

By adding group = year our plot will now look like a time series, run the code to check it out.

ggplot(monthly_steps,
       aes(x = month, y = avg_steps,
           group = year)) +
  geom_line() +
  geom_point()

It would also be helpful to see what year each line represents. We add the colour = year argument in as well to show this.

ggplot(monthly_steps,
       aes(x = month, y = avg_steps,
           group = year, colour = year)) +
  geom_line() +
  geom_point()

Our plot is still looking a little busy so we can use facets to split our data by year. We’ve used facet_wrap here with 3 rows.

ggplot(monthly_steps,
       aes(x = month, y = avg_steps,
           group = year, colour = year)) +
  geom_line() +
  geom_point() +
  facet_wrap(vars(year), nrow = 3)

Finally, we can make a few final adjustments and we have a nice visualisation that shows average step count per month for the year 2019 to 2021. Below is a list of all the additions make to change the look of the plot:

  • Changed the size of the lines and the points with the size = argument
  • Added a title and changed the axis names
  • Added a colour scale from the RColorBrewer package
  • Changed the theme to dark and changed the font to Avenir
  • Adjusted the y axis limits
step_count <- ggplot(monthly_steps,
       aes(x = month, y = avg_steps,
           group = year, colour = year)) +
  geom_line(size = 2.5) +
  geom_point(size = 3) +
  facet_wrap(vars(year), nrow = 3) +
  labs(title = "Average step count per month for the year 2019 to 2021",
       x = "Month", y = "Average steps (mean)",
       colour = "Year") +
  scale_colour_brewer(palette = "Pastel2") +
  theme_dark(base_family = "Avenir") +
  scale_y_continuous(limits = c(7000, 9000)) 

step_count
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).

ggsave("step_count.png", step_count, width = 9)
## Saving 9 x 5 in image
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Removed 1 rows containing missing values (geom_point).

Time series plots exercise

For this exercise we will be looking at the vaccine roll out for United Kingdom, India, Nepal, Israel, Germany, and Australia. Each country has had slightly different roll outs, with Israel being the fastest. We will be looking at the week by week roll out for 2021.

Data preparation:

  1. Make a vector called sel_country that includes United Kingdom, India, Nepal, Israel, Germany, and Australia
  2. Filter your covid data to include only locations that are in your sel_country vector, and filter for the year to be equal to 2021. Assign your filtered data to a variable called weekly_vax.
  3. Aggregate your weekly_vax data by week and location to find the mean of the people_vaccinated_per_hundred column. Assign the result back to weekly_vax
  4. Make the week and location columns of weekly_vax factors

Plotting:

Using your weekly_vax data you have just prepared:

  1. Make a time series plot with week as your x axis and your aggregation of the people_vaccinated_per_hundred column as your y axis.
  2. Colour and group your data by location.
  3. Make any aesthetic changes you think will make the plot better based on what we have covered so far, such as adding titles, changing colours, or adding facets (facet_grid() or facet_wrap()).
  4. Assign your final plot to a variable and save it!

Hint: if your x axis is looking squashed or cramped, try adding in scale_x_discrete(guide = guide_axis(n.dodge = 2))

# your code here

Final task - Please give us your individual feedback!

We would be grateful if you could take a minute before the end of the workshop so we can get your feedback!

https://lse.eu.qualtrics.com/jfe/form/SV_eflc2yj4pcryc62?coursename=R%20Data%20Visualisation%202:%20Box,%20histogram,%20and%20line%20plots&topic=R&link=https://lsecloud.sharepoint.com/:u:/s/TEAM_APD-DSL-Digital-Skills-Trainers/ERDaMePD5XBKgxuOMtN94YoB4aDZ5dxXqgPXBDdzWFxYSQ&prog=DS&version=21-22

The solutions we be available from a link at the end of the survey.

Individual coding challenge

For the coding challenge we will look at other things you can do with ggplot2 such as making artwork! This is known as generative art, which is produced either in part or completely by automated processes.

Generative art is a complex topic, but some of the ideas and styles can be done using the aRtsy package, https://koenderks.github.io/aRtsy/, which makes generative art more accessible.

First, you will need to install the aRtsy package.

# install aRtsy
install.packages("aRtsy")

Then you will need to load it!

# load aRtsy
library(aRtsy)

When making generative art it is a good idea to make it reproducible as we there is a lot of randomisation involved. When randomising in R you need to set a seed, which in simple terms means we reproduce our results using the same seed. We use the set.seed() function and add in any number. The number is our seed. If we gave someone else our code and our seed they would be able to reproduce or results.

We’ve given some examples below on making a striped artwork and flow fields. Run the code chunk below, then try changing the seed to see how the results change when you run it again!

Note: these will take a few moments to run!

# set the seed to 1
set.seed(1)

# make a colour palette from rcolorbrewer
set1 <- brewer.pal(n = 9, name = "Set1")
pastel1 <- brewer.pal(n = 9, name = "Pastel1")
paired <- brewer.pal(n = 12, name = "Paired")

# test out different parameters for stripes
canvas_stripes(paired, n = 800, H = 5, burnin = 5)

canvas_stripes(pastel1, n = 500, H = 15, burnin = 2)

# Test out different parameters for flow fields
canvas_flow(set1, background = "#fafafa", lines = 800, lwd = 0.30,
            iterations = 80, stepmax = 0.15)

pastel_flow <- canvas_flow(pastel1, background = "black", lines = 2000, lwd = 0.15,
            iterations = 30, stepmax = 0.10)

pastel_flow

# save pastel_flow
saveCanvas(pastel_flow, "pastel_flow.png")

Have a go yourself at making some generative art in R! Try out the following functions from aRtsy, changing the parameters to adjust the visualisation.

Don’t forget to save any of your artwork you like using the saveCanvas() function.

set.seed(1)

# your code here

Recommened resources to continue your data visualiation learning

The ggplot2 book is an excellent resource with lots of examples and exercises to have a go at https://ggplot2-book.org/.

Cedric Scherer writes blogs and tutorials on ggplot2 on his website. Some of his content is really great and worth looking through. Below are two of his tutorials to get you started:

Georgios Karamanis is a data visualisation designer and makes some amazing visualisations using R! It’s worth browsing his website for inspiration https://karaman.is/ or following him on twitter https://twitter.com/geokaramanis.

For ideas about what to do with your data have a look at the R graph gallery https://www.r-graph-gallery.com/.

---
title: "R Data Visualisation 1 & 2: Fast-track"
author:
   - name: Andrew Moles
     affiliation: Learning Developer, Digital Skills Lab
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: 
  html_document: 
    theme: readable
    highlight: pygments
    keep_md: no
    code_download: true
    toc: true
    toc_float: 
      collapsed: true
---

# R Data Visualistion 1 - Objective of workshop

To create scatter and bar plot visualisations using the ggplot2 package. 

# What this workshop will cover

In this workshop, the aim is to cover how to use the ggplot2 package. We will be covering:

-   An introduction to the ggplot2 package
-   How to make scatter plots with ggplot2
-   How to make bar plots with ggplot2
-   How to change colours and other features in your visualisations  

------------------------------------------------------------------------

# Introduction

Data visualisation is a way of looking at your data using graphics, which provides a different perspective to your data.

There are a lot of different options for data visualistion with R. You can use the visualisation tools that come with R, ggplot and all its extensions, or for interactive visualisations there is the plotly library.

![](https://github.com/andrewmoles2/rTrainIntroduction/blob/main/r-data-visualisation-1/images/ggplot2_exploratory.png?raw=true){width="437"}

In this data visualisation series we will be mainly focussing on ggplot, as well as plotly. While the visualistion tools that come with R are useful, ggplot and plotly are generally easier to use and make great visualisations with. For this tutorial we will be using the below packages: ggplot2, dplyr, readr, janitor. Run the code below to install the packages if you don't have them installed already.
```{r eval=FALSE}
# install packages
install <- c("ggplot2", "dplyr", "readr",
             "janitor", "RColorBrewer",
             "forcats")

install.packages(install, Ncpus = 6)
```

Then we need to load them into our session. Run the code chunk below to load all the libraries you will need.
```{r message=FALSE}
# load packages
library(ggplot2)
library(dplyr)
library(readr)
library(janitor)
library(RColorBrewer)
library(forcats)
```

# What is ggplot and how does it work?

ggplot2 is a package for producing graphics that works by combining independent components when making graphs, known as layers. This makes ggplot2 both versatile and powerful; you are not limited by a set of options but instead can make novel graphics to suit your needs.

It is also important to note that ggplot can only use data frames. If your data is in another format you will need to transform it into a data frame in order to use ggplot.

In order to understand how the layers work we will first load in some data for our examples. We will use data from the Pokémon games, which was web scraped from <https://pokemondb.net/pokedex/all>.
```{r message=FALSE}
# load and clean names
pokemon <- read_csv("https://raw.githubusercontent.com/andrewmoles2/webScraping/main/R/data/pokemon.csv") %>%
  clean_names()
# review data
pokemon %>%
  glimpse()
```

The syntax for ggplot has three key components. The ggplot function call (`ggplot()`), the aesthetics (called `aes()`), and the geometry (called geoms) which refers to scatter, bar, or line plots for example. The next three code chunks break this down.

```{r}
# call ggplot2 and add data
ggplot(pokemon)
```

Notice we just get a grey box. We have just loaded our data into ggplot but not much else! Now lets add the aesthetics and see what happens.

Too add aesthetics we use the `aes()` function within the `ggplot()` function, and specify what our x and y axis will be with column names from our data, sp_atk and sp_def in this case.

```{r}
# add aesthetics
ggplot(pokemon, aes(x = sp_atk, y = sp_def))
```

It is starting to look more like a visualisation now, we can see the x and y axis labels, but we still have no data points showing. We have to add a geometry for that to happen. Notice the syntax here, we use the `+` icon to add a geometry to ggplot, which in this case is `geom_point()` which makes scatter plots. All geometry functions start with `geom_` and end with the type of geometry such as point, bar, or line.

```{r}
# pick which geometry
ggplot(pokemon, aes(x = sp_atk, y = sp_def)) +
  geom_point()
```

This is the fundamental concept of ggplot, you construct your visualisations in layers, adding geometry layers, and other features as you go.

## what is ggplot exercise

Using the pokemon data, make a scatter plot with *hp* on the x axis and *speed* on the y axis.

```{r}
# your code here

```

# Scatter plots

Scatter plots are for displaying the relationship between two numeric (or quantitative) variables. For each data point, the values of its first variable is represented on the X axis and the second on the Y axis.

To make a scatter plot with ggplot2 we use the `geom_point()` function like you just saw. In order for ggplot to make a scatter plot, the X and Y axis must be numeric.

The plot we just made in the example is okay but it could do with some improving. There are quite a few different ways to change the appearance of a visualisation, lets go through them.

The first thing we will look at is adding some colour! There are a few options for adding colours to your plots. You can add the name, such as red, or you can use a hex code, or you can use a pre-defined palette. To add colour to a scatter plot we use the `colour =` argument.

```{r}
# colour of points
ggplot(pokemon, aes(x = hp, y = speed)) +
  geom_point(colour = "orange")
```

To colour your points by a group (or factor) we have to add the colour argument into the `aes()` function. This allows us to have different colours for different groups, which makes the plot more informative.

In the below example, our data is coloured by if a pokemon is classified as legendary or not.

```{r}
# colour by factor
ggplot(pokemon, aes(x = hp, y = speed, colour = legendary)) +
  geom_point()
```

We get the default ggplot colours which are okay. There are a few different ways of changing the colours, all methods use the `scale_` function in a slightly different way. In the two examples below we have changed the colours using the RColorBrewer package and have set the colours manually.

RColorBrewer comes with a set of palettes for different situations, you can view them by following this link <https://www.r-graph-gallery.com/38-rcolorbrewers-palettes.html>. To use these palettes with ggplot we use the `scale_colour_brewer()` function with an argument for which palette we want to use; in this example we are using Set1.

```{r}
library(RColorBrewer)
# adjusting colour by factor using RColorBrewer
ggplot(pokemon, aes(x = hp, y = speed, colour = legendary)) +
  geom_point() +
  scale_colour_brewer(palette = "Set1")
```

To make a manual palette, you first make a vector with your colours, to do so it is useful to use a colour picker such as <http://tristen.ca/hcl-picker/#/hlc/6/1/15534C/E2E062> or <https://coolors.co/>. You copy the hex code (code with \# then 6 numbers of letters) and paste it into your vector like you can see in the manual_pal vector below. To add the colour we use `scale_colour_manual()` function, and set the values to our manual palette.

```{r}
# adjusting colour by factor using manual palette
manual_pal <- c("#90C0F8", "#EA964E")

ggplot(pokemon, aes(x = hp, y = speed, colour = legendary)) +
  geom_point() +
  scale_colour_manual(values = manual_pal)
```

It is sometimes helpful to view the palette before using it. We can use the scales package for this, which is installed when you install ggplot2. We provide the `show_col()` function with the palette we want to view and it returns a grid view of the colours. In the example we look at Set1 from RColorBrewer and the manual palette we just used.

```{r}
# load scales
library(scales)
# view palettes
show_col(RColorBrewer::brewer.pal(n = 8, name = "Set1"))
show_col(manual_pal)
```

As well as changing the colour of the points, you can change their shape, size, and transparency (alpha). Just like with colour, we can define the size, shape or transparency either in the `aes()` function or in a `geom_` function. By adding them to the `geom_` function we manually change them. If we use them in `aes()` we have to associate the size/shape/alpha with a variable.

See the below example, first we manually set the size and alpha. In the second example we set the size to be defined by the total column in our pokemon data, and manually set the alpha.

```{r}
# manually set size and alpha
ggplot(pokemon, aes(x = hp, y = speed, colour = legendary)) +
  geom_point(size = 5, alpha = 0.6) +
  scale_colour_brewer(palette = "Set1")

# manually set alpha, size by total
ggplot(pokemon, aes(x = hp, y = speed, colour = legendary, size = total)) +
  geom_point(alpha = 0.6) +
  scale_colour_brewer(palette = "Set1")
```

To manually change the shape and replace the circles, we give the shape argument a number. Each number represents a shape, letter, or number; by default ggplot uses shape number 19. We can change the shape to a square for example by using the number 15.

```{r}
# default shape number
ggplot(pokemon, aes(x = hp, y = speed, colour = legendary, size = total)) +
  geom_point(alpha = 0.6, shape = 19) +
  scale_colour_brewer(palette = "Set1")

# shape number for squares
ggplot(pokemon, aes(x = hp, y = speed, colour = legendary, size = total)) +
  geom_point(alpha = 0.6, shape = 15) +
  scale_colour_brewer(palette = "Set1")
```

View the image with the visual markdown editor to see what number represents what shape, letter, or number.

![](https://github.com/andrewmoles2/rTrainIntroduction/blob/main/r-data-visualisation-1/images/shapes.png?raw=true){width="800" height="900"}

Finally we can add a title and save our plot! We've done two things in order to achieve this. To add a title, and change axis labels, we have used the `labs()` function. We add arguments for what we want to change, such as `title = "Pokemon Hit Points vs Speed"`. To change the legend labels we use colour and size, as we used these to define our legend in the `aes()` function.

To save the plot we assign our code to a variable, then we use the `ggsave()` function, which requires what you want to call the file and the file extension (e.g. plot.PNG or plot.JPG), then the ggplot object we created. Run the example below, and you should get a hp_vs_speed.PNG file where your Rmd file is saved. You can also adjust the size of the image saved using the width and height arguments.

```{r}
# save plot to a variable
hp_vs_speed <- ggplot(pokemon, aes(x = hp, y = speed, colour = legendary, size = total)) +
  geom_point(alpha = 0.6, shape = 15) +
  scale_colour_brewer(palette = "Set1") +
  labs(title = "Pokemon Hit Points vs Speed",
       subtitle = "Taken from pokemondb.net",
       x = "Hit Points",
       y = "Speed",
       colour = "Legenary pokemon?",
       size = "Total stats")

hp_vs_speed

# save plot
ggsave("hp_vs_speed.PNG", hp_vs_speed)

# save with defined width and height
ggsave("hp_vs_speed.PNG", hp_vs_speed,
       width = 7, height = 4.5)
```

## Scatter plots exercise

For the exercises in this workshop we will use data from the Olympics that includes all Olympic games from 1896 through to 2016. More information on the dataset can be found here <https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md>. Run the code provided to load the libraries and data into R.

We will make two scatter plots from the Olympics data. For both plots we will use dplyr to filter the information we are interested in, which has been done for you in this exercise. 

1)  Using the provided `scatter_plot1` data, make a scatter plot of Olympic gymnasts heights (x axis) and weights (y axis).

-   Change the colour and shape arguments to tell us what sex the gymnasts are.
-   Change the colour palette by making a manual one or using RColorBrewer.
-   Be sure to give your plot a title, and save your plot.

2)  Using the provided `scatter_plot2` data, make a scatter plot of the age (y axis) of gymnastic medal winners by year (x axis). 

-   Colour your plot by medal by making a manual colour palette. *hint: the hex codes for gold, silver and bronze are: "#FFD700", "#C0C0C0", "#CD7F32"*
-   Use shape to tell us what sex the gymnasts were.
-   Be sure to give your plot a title, and save your plot.


```{r}
# make sure libraries are loaded
library(readr)
library(dplyr)
library(ggplot2)
library(RColorBrewer)

# load in data
olympics <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv")

olympics %>% glimpse()

# data cleaning for first scatter plot
scatter_plot1 <- olympics %>%
  filter(sport == "Gymnastics")

# data cleaning for second scatter plot
scatter_plot2 <- olympics %>%
  filter(sport == "Gymnastics") %>%
  filter(!is.na(medal)) %>%
  mutate(medal = factor(medal, levels = c("Gold", "Silver", "Bronze")))

# your code here

```

# Quirks of ggplot2

There are a few quirks to be aware of when using ggplot2 and you'll see a few of them when you look for examples online. In order to aid with this, we can have a look at a few of them!

The first quirk is piping data into ggplot, where you do not need to add your data into the `ggplot()` function as it is piped in. The main advantage of this approach is you can string together some data cleaning and then pipe the results straight into ggplot.

```{r}
# piping data into ggplot
pokemon %>%
  ggplot(aes(x = sp_atk, y = sp_def)) +
  geom_point()

# piping with filter
pokemon %>%
  filter(type1 == "Fire") %>%
  ggplot(aes(x = sp_atk, y = sp_def)) +
  geom_point()
```

The second quirk is adding aesthetics into a `geom_` function rather than the `ggplot()` function.

```{r}
# adding aesthetics into the geom_ call
ggplot(pokemon) +
  geom_point(aes(x = sp_atk, y = sp_def))
```

The third quirk is you can also add the data into the `geom_` function. When doing so you have to have `data =` otherwise you will get an error.

```{r}
# adding data and aesthetics into the geom_ call
ggplot() +
  geom_point(data = pokemon, aes(x = sp_atk, y = sp_def))
```

The fourth quirk relates to the second and third, in that you can add aesthetics into a `geom_` function more than once. You might occasionally come across this for more complex visualisations.

In the example we will add the average of our x and y variables. First we make a summary table that has the averages of both axis's, using `summarise()` from dplyr. Then we add two `geom_point()` functions, one with the pokemon data, and one with our summary table data.

```{r}
# why adding aesthetics into the geom_ call
# calculate mean of sp_atk and sp_def
avg_sp <- pokemon %>%
  summarise(
    avg_sp_atk = mean(sp_atk, na.rm = TRUE),
    avg_sp_def = mean(sp_def, na.rm = TRUE))

avg_sp

# add average sp_atk and sp_def as black point
ggplot() +
  geom_point(data = pokemon, 
             aes(x = sp_atk, y = sp_def), 
             colour = "orange",
             size = 2.5) +
  geom_point(data = avg_sp, 
             aes(x = avg_sp_atk, y = avg_sp_def),
             size = 2.5)
```

The last quirk we will look at is adding to a ggplot visualisation after you have assigned it a name. This is very common in tutorials and on Stack Overflow. A good use of this is to build a base of the x and y you want to use and test out different geometries.

```{r}
# saving plot then adding to it
p <- ggplot(pokemon, aes(x = sp_atk, y = sp_def))

p

p + geom_point()

p + geom_line()
```

# Quirks of ggplot2 exercise

Make a visualisation of USA athletes ages vs heights, showing the difference between the genders using colour. When making your visualisation try to

-   Pipe the olympics data to a filter function and select all USA athletes
-   Pipe to a ggplot function
-   Add a geom_point function and add the aesthetics there rather than in `ggplot()`

```{r}
# your code here

```

# Bar plots with counts

Bar plots are used to show relationships between a numerical and categorical variable. The categorical variable is usually on the x axis, and the y axis is usually a frequency count.

By default, bar plots with ggplot only require an x or y axis. From that they make a frequency count of that variable. See the example below. First we use ggplot to make a bar plot to count the number of pokemon added in each generation. Then we do the same thing with dplyr to make a aggregate table, ggplot is taking this aggregate table and making into a plot for us!

It is important to make sure your x axis in a bar plot is a factor, as this helps ggplot to order the axis as you expect.

```{r}
# make generation a factor
pokemon$generation <- factor(pokemon$generation)

# default bar plot
ggplot(pokemon, aes(x = generation)) +
  geom_bar()

# dplyr aggregate equivalent
pokemon %>%
  count(generation)
```

To add colour to your bar plot we use the fill argument rather than colour. This can be confusing, and sometimes if you forget, just try both till the colours look right! To add our fill manually we add the fill command to our `geom_bar()` function.

```{r}
# manually add fill colour
ggplot(pokemon, aes(generation)) +
  geom_bar(fill = "purple")
```

Just like with the scatter plot, we can colour our plot by a variable by putting the fill argument within the `aes()` function. The below example also shows the equivalent when doing aggregation using dplyr.

```{r}
# bar plot with colour by variable
ggplot(pokemon, aes(x = generation, fill = legendary)) +
  geom_bar()

# dplyr aggregate equivalent
pokemon %>%
  count(generation, legendary)
```

Notice in the above example that the bars by default were stacked on top of each other. We have two other options for changing this with either a dodge setting (sit next to each other) or a fill setting (stacked and standarised). To change this we use the position argument within `geom_bar()`.

```{r}
# dodge bars
ggplot(pokemon, aes(x = generation, fill = legendary)) +
  geom_bar(position = "dodge")
# filled bars
ggplot(pokemon, aes(x = generation, fill = legendary)) +
  geom_bar(position = "fill")
```

A useful thing to change with bar plots is to *flip* your coordinates. This is particularly useful if your x axis contains text. In the example below we will use the type1 variable as our x axis to see the difference. When we don't flip the coordinates, the x axis is hard to read.

```{r}
ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar()

ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar() + 
  coord_flip()
```

To change our colours we use the `scale_fill_` function. This is very similar to what we did with scatter plots except we are using fill this time, rather than colour.

```{r}
# change fill with RColorBrewer
ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar() + 
  coord_flip() +
  scale_fill_brewer(palette = "Set1")

# change fill with manual palette
ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar() + 
  coord_flip() +
  scale_fill_manual(values = manual_pal)
```

Currently our plots have the default ggplot theme which has a grey background. We can change this by setting a new theme. To do so you use `theme_` and select a theme which works best.

```{r}
# change theme to black and white
ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar() + 
  coord_flip() +
  scale_fill_manual(values = manual_pal) +
  theme_bw()

# change theme to dark
ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar() + 
  coord_flip() +
  scale_fill_manual(values = manual_pal) +
  theme_dark()
```

Adding a theme to each plot can be tiring, so instead you can set a theme for all your plots by using the `theme_set()` function. Usually you set the theme before you make any of your visualisations. Now we have changed the theme to black and white, all our plots from now on will have a black and white theme.

```{r}
# set global theme
theme_set(theme_bw())

# see result
ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar() + 
  coord_flip() +
  scale_fill_manual(values = manual_pal)
```

It is often useful and helpful to arrange the values by their rank or size. There are options to do this with base R, but the `forcats` library from the tidyverse makes arranging and ordering functions very straightforward.

We will use the `fct_infreq()` function, which means factors in frequency, in effect ordering our factors by the frequency they appear. There are two approaches. First we use the `fct_infreq()` function within ggplot, or second we arrange our factor outside ggplot. Outside of ggplot is usually better as you have more control and it make your ggplot code easier to read.

```{r}
# load forcats
library(forcats)

# arrange by frequency within ggplot
ggplot(pokemon, aes(x = fct_infreq(type1), fill = legendary)) +
  geom_bar() + 
  coord_flip() +
  scale_fill_manual(values = manual_pal)

# arrange by frequency outside ggplot
pokemon$type1 <- fct_infreq(pokemon$type1)

ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar() + 
  coord_flip() +
  scale_fill_manual(values = manual_pal)
```

We can also reverse the ordering by putting putting our `fct_infreq()` function inside a `fct_rev()` function (stands for factor reverse).

```{r}
# arrange by frequency (descending)
pokemon$type1 <- fct_rev(fct_infreq(pokemon$type1))

ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar() + 
  coord_flip() +
  scale_fill_manual(values = manual_pal)
```

More information on the forcats package can be found here: <https://forcats.tidyverse.org/index.html>

Finally, let's save and label our example bar plot.

```{r}
# save and label
count_type1 <- ggplot(pokemon, aes(x = type1, fill = legendary)) +
  geom_bar() + 
  coord_flip() +
  scale_fill_manual(values = manual_pal) +
  labs(title = "Frequency of each Pokemon type",
       subtitle = "Coloured by if legendary or not",
       y = "Frequency of Pokemon type",
       x = "Type of Pokemon",
       fill = "Legendary pokemon?")

count_type1

ggsave("count_type1.PNG", count_type1)
```

## Bar plots with counts exercise

Using the examples above, make a visualisation of the frequency of ski jump medal winners per country (team) from the Olympics dataset.

Try to include:

-   Setting a new theme using `theme_set()`.
-   Order the x axis by the frequency in reverse order. *hint: remember the forcats package*
-   Make medals a factor, re-order them, and then colour them like we did in the last exercise.
-   Decide if position stack, dodge or fill work best with this visualisation.
-   Add a title and labels.
-   Save your visualisation.

```{r}
# your code here


```

# Bar plots with other statistics

A very useful function of bar plots is to show a group average instead of frequency. There are two approaches to showing a group average in a bar plot.

The first route is aggregate your dataset, then add it into your bar plot as shown in the example below. We first use `group_by()` and `summarise()` from dplyr to find an average, in this case the average total statistics by pokemon generation.

We then put this data into ggplot. The difference from a normal bar plot is we provide a y axis (our calculated average), and add `stat = "identity"` to the `geom_bar()` function.

This is a great approach as it is easy to see what is happening at each step, making it simple to identify issues and make changes if needed.

```{r}
# group and summarise to make average
avg_total_gen <- pokemon %>%
  group_by(generation) %>%
  summarise(avg_total = mean(total, na.rm = TRUE))

# print result
avg_total_gen

# add to bar plot with stat identity
ggplot(avg_total_gen, aes(x = generation, y = avg_total)) +
  geom_bar(stat = "identity")
```

The other approach is to use the `stat_summary()` function to perform the same plot. The difference from a normal bar plot is we again provide the y axis but provide the variable we want to aggregate, total in this case. We then call `stat_summary()` and add two arguments, the function we want to use and what type of geometry to use; we've used mean and bar.

While this is less code, which is a good thing, it is hard to understand the steps taken to make the summary.

```{r}
ggplot(pokemon, aes(x = generation, y = total)) +
  stat_summary(fun = "mean", geom = "bar")
```

We can also add error bars to our plots to help us understand how precise our average measure is. To add error bars it is generally easier to use the group_by and summarise approach. We will look at two types of error bars, the standard deviation and the standard error of the mean.

The standard deviation indicates how close sample values are to the average of all data points, and the accuracy of the average. The standard error of the mean is the discrepancy of the sample mean and the true mean, telling you the accuracy of the sample mean.

To calculate, we do the same aggregation as we did before but add sd (standard deviation) to the summarise function and calculate the sem (standard error of the mean) in a mutate function.

```{r}
# group and summarise to make average and sd per group
avg_total_gen <- pokemon %>%
  group_by(generation) %>%
  summarise(avg_total = mean(total, na.rm = TRUE),
            sd = sd(total, na.rm = TRUE)) %>%
  mutate(sem = sd/sqrt(length(sd)))

# print result
avg_total_gen
```

To add error bars we use the `geom_errorbar()` function, which requires two arguments within an `aes()` function, the `ymin` and `ymax`. To find `ymin` or `ymax` we plus or minus our avg_total (y axis value) by the sd/sem.

```{r}
# adding standard deviation error bars
ggplot(avg_total_gen, aes(x = generation, y = avg_total)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = avg_total-sd, ymax = avg_total+sd)) +
  labs(title = "Average Pokemon total statistics by generation",
       subtitle = "Error bars indicate standard deviation")

# adding standard error bars
ggplot(avg_total_gen, aes(x = generation, y = avg_total)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = avg_total-sem, ymax = avg_total+sem)) +
  labs(title = "Average Pokemon total statistics by generation",
       subtitle = "Error bars indicate standard error of the mean")
```

You can edit the look of the error bars, such as making them narrower and changing the colour. See the example below on how to do this. We've also changed the colour of the bars too.

```{r}
ggplot(avg_total_gen, aes(x = generation, y = avg_total)) +
  geom_bar(stat = "identity", fill = "orange") +
  geom_errorbar(aes(ymin = avg_total-sem, ymax = avg_total+sem), width = 0.3, colour = "darkblue") +
  labs(title = "Average Pokemon total statistics by generation",
       subtitle = "Error bars indicate standard error of the mean")
```

If you want to add error bars to bar plots with different groupings on the x axis we need to made a few subtle changes, the main change is we need to have a dodge bar chart.

First we will re run our avg_total_gen aggregation and add another column to our group_by. We then pre-define how wide the bars and error bars should be. Instead of using `position = "dodge"` we use our dodge variable we just made, and add the fill to be legendary (our second grouping).

```{r}
# group by legendary as well
avg_total_gen <- pokemon %>%
  group_by(generation, legendary) %>%
  summarise(avg_total = mean(total, na.rm = TRUE),
            sd = sd(total, na.rm = TRUE)) %>%
  mutate(sem = sd/sqrt(length(sd)))

# pre-define the dodge position
dodge <- position_dodge(width = 0.8)

ggplot(avg_total_gen, aes(x = generation, y = avg_total, fill = legendary)) +
  geom_bar(stat = "identity", position = dodge) +
  geom_errorbar(aes(ymin = avg_total-sem, ymax = avg_total+sem), position = dodge, width = 0.3) +
  labs(title = "Average Pokemon total statistics by generation",
       subtitle = "Error bars indicate standard error of the mean") +
  scale_fill_manual(values = manual_pal)
```

## Bar plots with other statistics exercise

Using the examples above and the Olympics dataset, make a visualisation of the average age (mean or median) of GBR (Great Britain) medal winners by medal type and gender, making sure to

-   show the difference between male and female athletes using colours
-   show error bars for either standard deviation or standard error of the mean
-   colour, label and save your visualisation

*hint: don't forgot to use dodge `<- position_dodge(width = 0.8)`*

```{r}
# your code here

```

# Beyond bar plots

Bar plots are not the only option to view aggregated data, and there are some sources that suggest bar plots are less than ideal for any visualisation other than showing the frequency of a continuous variable. See <https://paulvanderlaken.com/2018/12/17/avoid-bar-plots-for-continuous-data-do-this-instead/> for details on this.

Fortunately, there are alternatives, such as box plots which will be covered in the second data visualisation workshop, or we can use scatter plots! Scatter plots allow us to see all the data and we can add on an average, the best of both worlds.

In order to recreate what we just did with bar plots with scatter plots we can either use both `geom_point()` and `stat_summary()`, or make a summary table and add that using a second `geom_point()` function. First, lets just plot the data as a scatter plot, making the points larger and more transparent. Lowering the transparency (alpha) is important in these plots as darker colours indicate a higher density of data points.

```{r}
ggplot(pokemon, aes(x = generation, y = total)) +
  geom_point(size = 5, alpha = .33)
```

Now we can add the `stat_summary()` function. We are going to use the mean, the geom is point, and the shape is a the `-` symbol (number 95); we will also make the shape larger so we can see it easier.

```{r}
# using stat_summary
ggplot(pokemon, aes(x = generation, y = total)) +
  geom_point(size = 5, alpha = .33) +
  stat_summary(fun = mean, geom = "point",
               shape = 95, size = 20)
```

If we use the summary table option we first make a summary table with `group_by()` and `summarise()`. Then we add two `geom_point()` functions. The first has the pokemon data and our x and y axis. The second is our summary table, with the same x axis and the `avg_total` as the y axis. 
```{r}
# summary table option
gen_avg_total <- pokemon %>%
  group_by(generation) %>%
  summarise(avg_total = mean(total, na.rm = TRUE))

gen_avg_total

ggplot() +
  geom_point(data = pokemon,
             aes(x = generation, y = total),
             size = 5, alpha = .33) +
  geom_point(data = gen_avg_total,
             aes(x = generation, y = avg_total),
             shape = 95, size = 20)

```

Either option works well, but for the rest of the examples we will use the `stat_summary()` option as it is less code. 

Now we have all our data so we can see the number of points for each group, and we can see the average per group!

Finally, we can add colour by our grouped variable (legendary) and change the colour palette. Just like with the bar plots we can adjust the positioning from stack to dodge. The examples below show both stack and dodge versions.
```{r}
# position stacked
ggplot(pokemon, aes(x = generation, y = total, colour = legendary)) +
  geom_point(size = 5, alpha = 0.3) + 
  stat_summary(fun = mean, geom = "point",
               shape = 95, size = 20) +
  scale_colour_brewer(palette = "Set1")

# position dodge
dodge <- position_dodge(width = 0.8)

ggplot(pokemon, aes(x = generation, y = total, colour = legendary)) +
  geom_point(size = 5, alpha = 0.3, position = dodge) + 
  stat_summary(fun = mean, geom = "point",
               shape = 95, size = 20,
               position = dodge) +
  scale_colour_brewer(palette = "Set1")
```

## Beyond bar plots exercise

Recreate your last visualisation, average age (mean or median) of GBR (Great Britain) medal winners by medal type and gender, using the `geom_point()` and `stat_summary()` method detailed above.

```{r}
# your code here

```

# Individual coding challenge

For the individual coding challenge we will be using the food consumption data from tidy Tuesday: <https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-18/readme.md>.

Use what we have covered in this workshop to make two visualisations of this dataset:

-   A scatter plot showing consumption and co2 emissions for a selected country (e.g. UK or France)
-   A bar plot of average co2 emissions per food category. Display just six countries to compare, such as UK, France, Germany etc. and colour them.

Use some of the tips we used and showed to make the visualisations have labels, colours and look appealing. Try and have some fun with it! =)

```{r}
food_consumption <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv')

food_consumption %>%
  glimpse()

# your code here

```

------------------------------------------------------------------------

# Understanding which visualisation to use and when

Sometimes it can be hard to know where to start with a visualisation. A great first starting point is understanding the options depending on the data types you have available. This website gives lots of information and visual guides on this process: <https://www.data-to-viz.com/>

# Seeing what others have done with this data

The Olympics data we used for the exercises today is from the Tidy Tuesday GitHub repository. Tidy Tuesday is a social data visualisation challenge that happens every week and is a great way of learning about data viz.

The the link below to see what others have done and posted about using the Olympics data. Use it to get some ideas on what else you can try and do or get some inspiration from others. <https://twitter.com/search?lang=en&q=%23tidytuesday%20olympics&src=typed_query>

# Fun extra

As a fun extra you can manually determine shapes in your visualisation using `scale_shape_manual()`. We've also removed the guide which was unnecessary by using `guide = "none"`. 

In the example below, as our x axis is generation from 1 to 8, we can make generation 1 have a shape of the number 1 and so on. 
```{r}
ggplot(pokemon, aes(x = generation, y = total, shape = generation)) +
  geom_point(size = 5, alpha = .33) +
  stat_summary(fun = mean, geom = "point",
               shape = 95, size = 20) +
  scale_shape_manual(values = c(49:56),
                     guide = "none")
```

------------------------------------------------------------------------


# R Data Visualistion 2 - Objective of workshop

To create histograms, box, and time series plots using the ggplot2 package. 

# What this workshop will cover

In this workshop, the aim is to cover how to work with dates in plots, and use histograms and box plots. We will be covering: 

-   How to make box plots with ggplot2
-   Displaying distributions with histograms
-   Working with dates with the lubridate package
-   How to make time series line plots
-   How to split your data into facet grids

------------------------------------------------------------------------

In this data visualisation workshop we will be building on the concepts learnt in the first workshop, constructing visualisations using the `ggplot2` library. 

![](https://github.com/andrewmoles2/rTrainIntroduction/blob/main/r-data-visualisation-1/images/ggplot2_masterpiece.png?raw=true){width="541"}

We will be using one new package called *lubridate*, a tidyverse package which is designed to make working with dates and times easier; this will help us in making time series visualisations. **Run the the code below to install lubridate**. 

```{r eval=FALSE}
# install lubridate
install.packages("lubridate")
```

Before we start we will need to load the libraries we will be using during this session. **Run the code below to load your libraries**. 

```{r message=FALSE, warning=FALSE}
# libraries we will be using
library(ggplot2)
library(dplyr)
library(lubridate)
library(readr)
library(janitor)
library(RColorBrewer)
```

# Box plots

Box plots are designed to compare the differences of a categorical variable (samples or groups). They do this by displaying the summary statistics of a continuous variable (e.g. numeric) for each categorical variable. 

The summary statistics shown are: 

- The median (middle value)
- Interquartile range, known as IQR, which has values from 25% to 75% (or 25th to 75th percentile)
- First quartile, known as Q1, which has a value of 25%
- Second quartile, known as Q3, which has a value of 75%
- "minimum" value, calculated as `Q1 - 1.5*IQR`
- "maximum" value, calculated as `Q3 + 1.5*IQR`
- Outlier, which are values that fall outside of the maximum or minimum values

We will use data from the Pokémon games again for our examples for box plots, which was web scraped from <https://pokemondb.net/pokedex/all>.

```{r message=FALSE}
# load and clean names
pokemon <- read_csv("https://raw.githubusercontent.com/andrewmoles2/webScraping/main/R/data/pokemon.csv") %>%
  clean_names()
# review data
pokemon %>%
  glimpse()
```

For these examples we will just look at one type of Pokémon, the electric type; the most famous of which is Pikachu! First, we extract just the electric type Pokémon, and make relevant columns factors. 

```{r}
# select columns to convert to factor
to_factor <- c("type1", "type2", "generation")

# extract just electric pokemon and make cols factors
electric_pokemon <- pokemon %>%
  filter(type1 == "Electric" | type2 == "Electric") %>%
  mutate(across(all_of(to_factor), factor))

head(electric_pokemon)
```

To make a box plot in ggplot we use the `geom_boxplot()` geom function. One of our axis variables has to be categorical and the other has to be numeric. In the below example we will use generation (categorical) and total (numeric). 

```{r}
# generation by total
ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot()
```

From the output we see a few things. First is that each box has a line through the middle which indicates the median; the box itself is our interquartile range. The lines above and below the boxes (known as whiskers) are the maximum and minimum values. The black dots indicate outliers, which have fallen outside our max and min values. 

Just like with scatter and bar plots we can change the colours! You can use either fill or colour arguments with box plots, but fill tends to look better. 

We will use the colour of Pikachu to colour our boxes. We used the pokemon colour picker to get the colour of pikachu: <https://pokepalettes.com/#pikachu>

```{r}
ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot(fill = "#f6e652")
```

Sometimes it is useful to remove the outliers. To do so you add in the `outlier.shape = NA` argument. 

```{r}
ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot(fill = "#f6e652", outlier.shape = NA)
```

Displaying outliers is usually a good idea so we will keep them for now, and change the colour and shape of them. To adjust these we use `outlier.colour` and `outlier.shape` argments. We've used the colour of Pikachu's cheeks as the outlier colour and made the shape square. 
```{r}
ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot(fill = "#f6e652", outlier.colour = "#c52018",
               outlier.shape = 15)
```

## Box plots exercise 

For the exercises for this workshops we will be using daily COVID data that is collected from most of the countries around the world. 

COVID data is from our world in data, which is stored in a GitHub repository. More information on the data and what each variable means can be found here: <https://github.com/owid/covid-19-data/tree/master/public/data>

```{r}
# load in covid data and select cases, deaths and vaccines
covid <- read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv") %>%
  select(iso_code:new_deaths_smoothed_per_million, contains("vaccin"),
         population, median_age, gdp_per_capita)

# have a quick look at the data
covid %>% glimpse()
```

For this exercise will we make two box plots from our data looking more at the demographics of each continent (we will look at cases and vaccines later).

Your two box plots should show the following:

- The median age of each continent
- The gdp per capita for each continent
- Make sure to change the colour of the boxes and outliers to make it look better! 
- Try changing the shape and size of your outlier

Hint: you will have to remove the na values from continent before plotting, e.g. `covid %>% filter(!is.na(continent))`

Hint: You can pipe from your filter function straight into ggplot2!

Hint: You can add colours in lots of ways but it can be fun to use a colour picker <http://tristen.ca/hcl-picker/#/hlc/11/1.1/DC7261/D77357>. 

```{r}
# your code here


```


# Improving your box plots

The main issue with box plots, in a similar way to bar plots, is they can hide data. We can fix this by adding a scatter plot over the top of the boxes so we can see the full distribution of the data. 

When adding in a scatter plot, we won't need our outliers as the scatter plot will show these for us. We will need to remove them using the `outlier.shape = NA` argument. 

```{r}
ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot(fill = "#f6e652", outlier.shape = NA) +
  geom_point()
```

Some of our data points are overlapping which makes it a little hard to see all the data. We can fix this by changing the position of our points using the `position = "jitter"` argument. We can also use `geom_jitter()` which is a short hand for `geom_point(position = "jitter")`; we will use `geom_jitter()` going forward as it is less typing. 

```{r}
# change position in geom_point
ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot(fill = "#f6e652", outlier.shape = NA) +
  geom_point(position = "jitter")

# using geom_jitter
ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot(fill = "#f6e652", outlier.shape = NA) +
  geom_jitter()
```

We can also add in a colour grouping to our points to make them more meaningful. We add the colour aesthetic to our `geom_jitter` function. In the example we are colouring our points by if a pokemon is legendary or not. 

```{r}
ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot(fill = "#f6e652", outlier.shape = NA) +
  geom_jitter(aes(colour = legendary))
```

Finally we can change the colours of our points, which in this case we have done manually. Again, the colours were taken from the pokemon colour picker of pikachu: <https://pokepalettes.com/#pikachu>.

```{r}
ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot(fill = "#f6e652", outlier.shape = NA) +
  geom_jitter(aes(colour = legendary)) +
  scale_colour_manual(values = c("#c52018", "#41414a"))
```

Now we can add a title and save the plot! When saving the plot we have manually adjusted the width of the plot. You can also change the height. 

```{r}
electric_pokemon_box <- ggplot(electric_pokemon, aes(x = generation, y = total)) +
  geom_boxplot(fill = "#f6e652", outlier.shape = NA) +
  geom_jitter(aes(colour = legendary)) +
  scale_colour_manual(values = c("#c52018", "#41414a")) +
  labs(title = "Summary of electric pokemon for each generation") +
  theme_bw()

electric_pokemon_box

ggsave("electric_pokemon_box.png", electric_pokemon_box,
       width = 5.5)
```


## Improving your box plots exercise

For this exercise we will look at vaccines! We will look at 10 countries to see the difference in vaccine distribution; 5 have low gdp and 5 have high gdp. The data will be pre-prepared for you. We have made a vector with the counties that have high and low gdp. Then we have filtered our covid data by this vector, and made the location a ordered factor.

1) Make a box plot using the *covid_select_countries* data, with x = location and y = total_vaccinations_per_hundred. Be sure to include `geom_jitter()`.
2) Now improve the look of your box plot! Change the colour of the boxes and the points, make the points more transparent, remove the outliers, change the theme, and flip the co-ordinates. 
3) Make another box plot the same way but use the people_fully_vaccinated variable as your y axis. 
4) Give both your box plots a title and change the axis labels (if you want).
5) Save your plots using `ggsave()`. You will need to assign the plots to a variable first. 

```{r}
# Make vector with low and high gdp countries
high_low_gdp <- c("Sierra Leone", "Ethiopia","Yemen", 
                  "Zambia", "Nepal", "Sweden", "Australia",
                  "Saudi Arabia", "Germany", "United Kingdom")

# Only include locations in high_low_gdp
# Make location a factor, ordered by high_low_gdp
covid_select_countries <- covid %>%
  filter(location %in% high_low_gdp) %>%
  mutate(location = factor(location, levels = high_low_gdp))

# your code here


```

# Displaying distributions with histograms

Histograms are great for visualising the distribution of numeric data. Histograms have one numerical variable as their input. 

To make a histogram with ggplot we provide a numerical value to our x axis, and use the `geom_histogram()` geom. In the example we are using all the pokemon data and showing the distribution of the total column. 
```{r}
ggplot(pokemon, aes(x = total)) +
  geom_histogram()
```

We can adjust the size of the *bins* of our plot with two methods, changing the binwidth or selecting the amount of bins. When we talk about bins with histograms it refers to the size of each bar; the larger the bar the more data on the x axis is included. 

The first example uses `binwidth`. The number you provide is directly related to your x axis. In our example we are using the total column which goes up to 754. If we have `binwidth = 8`, then 8 data points will be included in each bin. Run the two examples below with a smaller and larger binwidth to see the results. 
```{r}
# summary stats for total column
summary(pokemon$total)

# binwidth of 8
ggplot(pokemon, aes(x = total)) +
  geom_histogram(binwidth = 8) +
  labs(title = "Small binwidth (8)")

# binwidth of 50
ggplot(pokemon, aes(x = total)) +
  geom_histogram(binwidth = 50) +
  labs(title = "Larger binwidth (50)")
```

The other method is to select the number of bins to use, using the `bins` argument. The more bins we use, the less data will be contained in each bin. In the example below we have bins with lots of data `bins = 10` and bins with less data `bins = 50`. Which do you think is best?

```{r}
# using 10 bins
ggplot(pokemon, aes(x = total)) +
  geom_histogram(bins = 10) +
  labs(title = "Less bins = more data in each bin")

# using 50 bins
ggplot(pokemon, aes(x = total)) +
  geom_histogram(bins = 50) +
  labs(title = "More bins = less data in each bin")
```

It can be helpful to colour your histogram by a categorical variable. This works the same as a box plot, using the `fill` argument. In the example we have filled our histogram by the legendary category. 

```{r}
ggplot(pokemon, aes(x = total, fill = legendary)) +
  geom_histogram(binwidth = 20)
```

Another useful method is to use *facets*, which split up your data by a categorical variable and presents them in a grid like formation. 

There are two techniques in ggplot to make facets, using `facet_grid()` or `facet_wrap()`. To use `facet_grid()` we define if we want to display our data row-wise (`rows = `) or column-wise (`cols = `). When defining which column to split our data by we need to use the `vars()` function. See the two examples below on how to do a row or column facet grid. 

```{r}
# row-wise display
ggplot(pokemon, aes(x = total, fill = legendary)) +
  geom_histogram(binwidth = 20) +
  facet_grid(rows = vars(legendary)) +
  labs(title = "Row-wise facet grid")

# column-wise display
ggplot(pokemon, aes(x = total, fill = legendary)) +
  geom_histogram(binwidth = 20) +
  facet_grid(cols = vars(legendary)) +
  labs(title = "column-wise facet grid")
```

The other option is `facet_wrap()`, which by default only needs the column you want to split your data by. It does allow extra specification with the `nrow` and `ncol` functions, allowing you to define how many rows and columns to display. 

In the examples below we show the default `facet_wrap`, and how to adjust the column or row specification. We have used the generation column as it has more groups.  
```{r}
# default facet_wrap
ggplot(pokemon, aes(x = total, fill = legendary)) +
  geom_histogram(binwidth = 20) +
  facet_wrap(vars(generation)) +
  labs(title = "Default facet wrap")

# 4 rows
ggplot(pokemon, aes(x = total, fill = legendary)) +
  geom_histogram(binwidth = 20) +
  facet_wrap(vars(generation),
             nrow = 4) +
  labs(title = "Facet wrap with 4 rows")

# 4 columns
ggplot(pokemon, aes(x = total, fill = legendary)) +
  geom_histogram(binwidth = 20) +
  facet_wrap(vars(generation),
             ncol = 4) +
  labs(title = "Facet wrap with 4 columns")
```


## Displaying distributions exercise

For this exercise we will be making a histogram of using the people_fully_vaccinated_per_hundred column for each continent 

- Make a histogram with people_fully_vaccinated_per_hundred as your x axis
- Add a fill argment with continent
- Adjust the `binwidth` or `bins` (e.g. `binwidth = 5` looks good)
- Using RColourBrewer, adjust the colours used in fill

Hint: you will have to remove the na values from continent before plotting, e.g. `covid %>% filter(!is.na(continent))`

Hint: You can pipe from your filter function straight into ggplot2!

Hint: To change the fill colours you can use `scale_fill_brewer(palette = "a palette")`

Hint: Use `brewer.pal.info` to find RColorBrewer palettes

```{r}
# your code here

```


# Working with the date data type with lubridate

Working with the date data type when programming can be a bit tricky for many reasons. There are different formats, time zones, and the challenge extracting information from the date. Fortunately, the `lubridate` package comes to the rescue! 

There are three types of date data type: date (2010-09-01), time (15:08:52 BST), date-time (2010-09-01 15:08:52 BST). For this workshop we will be focusing on the date type as it is the most common. 

You can find out today's date (more useful than it sounds) or the date and time using the `today()` or `now()` functions. 
```{r}
# make sure dplyr and lubridate are loaded
library(dplyr)
library(lubridate)

# get today's date
today()
# today's date and time
now()

# make today's date a variable
today_date <- today()
```

A great feature of lubridate is extracting the year, month, day, or week day information from your date. We can test it out on today's date. Run the code to see how the output. 

```{r}
# year
year(today_date)
# month
month(today_date)
month(today_date, label = TRUE)
# week
week(today_date)
# day
day(today_date)
# weekday
wday(today_date)
wday(today_date, label = TRUE)
```

Notice that for the `month` and `wday` functions we have the option to add labels. This can be very useful, making your month or week day outputs more readable. 

For the rest of the examples we will use some randomised made up data containing daily sleep, and step information. Run the code below to see the data. 

*note: to make this data we have used randomisation functions: `sample`, `runif` and `rnorm`, if you are interested look them up to see how they work*
```{r}
# make some random data
df <- data.frame(
  date = seq(as.Date("2019-01-01"), as.Date("2021-12-01"), by = "days"),
  hours_sleep = round(rnorm(1066, mean = 9, sd = 1.5)),
  steps = round(rnorm(1066, mean = 8000, sd = 2000))
)

head(df)
```

We can now use the `mutate` function to make a year, month, week, day, and week day column. 

```{r}
df <- df %>%
  mutate(year = year(date),
         month = month(date, label = TRUE),
         week = week(date),
         day = day(date),
         week_day = wday(date, label = TRUE))

head(df)

# see the breakdown of the date
df[1:2, c("date", "year", "month", "week", "day", "week_day")]
```

Breaking the date down in this way allows us to do some aggregation of our data by the year, month, week, day, or weekday! In the examples below we have shown year and weekday.

```{r}
# aggregate by year
df %>%
  group_by(year) %>%
  summarise(avg_sleep = mean(hours_sleep),
            avg_steps = mean(steps),
            total_steps = sum(steps))

# aggregate by week day
df %>%
  group_by(week_day) %>%
  summarise(avg_sleep = mean(hours_sleep),
            avg_steps = mean(steps),
            total_steps = sum(steps))
``` 

There are more functions from the lubridate package that we won't be able to cover in this session, so do have a look at the package website for more information - <https://lubridate.tidyverse.org/index.html> - and checkout the R for Data Science chapter on dates - <https://r4ds.had.co.nz/dates-and-times.html>. 

## lubridate exercise

Using the examples above, extract year, month, day, day of week from covid data, and do an aggregation! 

1) Add new columns to your covid data for year, month, week, day and week_day. Try to add labels to month and week_day. 
2) Aggregate your covid data by year and month to find the mean total cases per million and mean total deaths per million. 
3) Print out the result. 
```{r}
# your code here

# separate date column
covid <- covid %>%
  mutate(year = year(date),
         month = month(date, label = TRUE),
         week = week(date),
         day = day(date),
         week_day = wday(date, label = TRUE))

# make year and month aggregate
avg_year_month_covid <- covid %>%
  group_by(year, month) %>%
  summarise(
    avg_total_cases_per_mil = mean(total_cases_per_million, na.rm = TRUE),
    avg_total_deaths_per_mil = mean(total_deaths_per_million, na.rm = TRUE)
    )

avg_year_month_covid
```


# Time series plots

Time series plots visualise data over a period of time, which can be hourly, daily, weekly, monthly, or yearly. It is a great way to view trends over time. When plotting a time series, the x axis is the date and the y axis is your measure. 

The most simple form of a time series visualisation in R is to use an unedited date variable. Using our example data (`df`) we will visualise how steps have changed each day. 
```{r}
# daily time series
df %>%
  ggplot(aes(x = date, y = steps)) +
  geom_line()
```

As we can see it is pretty variable how many steps are taken each day, as you might expect. There is a lot of data here so it is hard to see any real patterns, it just looks like noise! To solve this we can aggregate our data by the year, the month or the week to see if we can get any more insights. 

For the example data we have it might be interesting to see the average of how many steps are taken on average each month, and to also compare this year on year.

We first aggregate our data, grouping by the month and year columns we made with the lubridate package, find the average steps, and convert the year column into a factor to make plotting easier; month is already a factor. 
```{r}
# aggregated time series by month
monthly_steps <- df %>%
  group_by(month, year) %>%
  summarise(avg_steps = mean(steps)) %>%
  mutate(year = factor(year))

monthly_steps
```

Now we can make a time series by month! It is often helpful when using `geom_line()` to also pair it with `geom_point()` so we can see each data point clearly as well as seeing the trends with shown by the lines. 

```{r}
ggplot(monthly_steps,
       aes(x = month, y = avg_steps)) +
  geom_line() +
  geom_point()
```

That didn't work as expected! As our data is grouped by year and month we need to use the `group = ` argument to tell ggplot we want to connect the months up. 

By adding `group = year` our plot will now look like a time series, run the code to check it out. 

```{r}
ggplot(monthly_steps,
       aes(x = month, y = avg_steps,
           group = year)) +
  geom_line() +
  geom_point()
```

It would also be helpful to see what year each line represents. We add the `colour = year` argument in as well to show this. 

```{r}
ggplot(monthly_steps,
       aes(x = month, y = avg_steps,
           group = year, colour = year)) +
  geom_line() +
  geom_point()
```

Our plot is still looking a little busy so we can use facets to split our data by year. We've used `facet_wrap` here with 3 rows. 

```{r}
ggplot(monthly_steps,
       aes(x = month, y = avg_steps,
           group = year, colour = year)) +
  geom_line() +
  geom_point() +
  facet_wrap(vars(year), nrow = 3)
```

Finally, we can make a few final adjustments and we have a nice visualisation that shows average step count per month for the year 2019 to 2021. Below is a list of all the additions make to change the look of the plot:

- Changed the size of the lines and the points with the `size = ` argument
- Added a title and changed the axis names
- Added a colour scale from the RColorBrewer package
- Changed the theme to dark and changed the font to Avenir
- Adjusted the y axis limits

```{r}
step_count <- ggplot(monthly_steps,
       aes(x = month, y = avg_steps,
           group = year, colour = year)) +
  geom_line(size = 2.5) +
  geom_point(size = 3) +
  facet_wrap(vars(year), nrow = 3) +
  labs(title = "Average step count per month for the year 2019 to 2021",
       x = "Month", y = "Average steps (mean)",
       colour = "Year") +
  scale_colour_brewer(palette = "Pastel2") +
  theme_dark(base_family = "Avenir") +
  scale_y_continuous(limits = c(7000, 9000)) 

step_count

ggsave("step_count.png", step_count, width = 9)
```

## Time series plots exercise

For this exercise we will be looking at the vaccine roll out for United Kingdom, India, Nepal, Israel, Germany, and Australia. Each country has had slightly different roll outs, with Israel being the fastest. We will be looking at the week by week roll out for 2021. 

Data preparation: 

1) Make a vector called *sel_country* that includes United Kingdom, India, Nepal, Israel, Germany, and Australia
2) Filter your covid data to include only locations that are in your sel_country vector, and filter for the year to be equal to 2021. Assign your filtered data to a variable called *weekly_vax.* 
3) Aggregate your *weekly_vax* data by week and location to find the mean of the `people_vaccinated_per_hundred` column. Assign the result back to *weekly_vax*
4) Make the week and location columns of *weekly_vax* factors
 
Plotting:

Using your *weekly_vax* data you have just prepared:

1) Make a time series plot with week as your x axis and your aggregation of the `people_vaccinated_per_hundred` column as your y axis. 
2) Colour and group your data by location. 
3) Make any aesthetic changes you think will make the plot better based on what we have covered so far, such as adding titles, changing colours, or adding facets (`facet_grid()` or `facet_wrap()`).
4) Assign your final plot to a variable and save it! 

Hint: if your x axis is looking squashed or cramped, try adding in `scale_x_discrete(guide = guide_axis(n.dodge = 2))` 

```{r}
# your code here


```


# Final task - Please give us your individual feedback!

We would be grateful if you could take a minute before the end of the workshop so we can get your feedback!

<https://lse.eu.qualtrics.com/jfe/form/SV_eflc2yj4pcryc62?coursename=R%20Data%20Visualisation%202:%20Box,%20histogram,%20and%20line%20plots&topic=R&link=https://lsecloud.sharepoint.com/:u:/s/TEAM_APD-DSL-Digital-Skills-Trainers/ERDaMePD5XBKgxuOMtN94YoB4aDZ5dxXqgPXBDdzWFxYSQ&prog=DS&version=21-22>

The solutions we be available from a link at the end of the survey.

# Individual coding challenge

For the coding challenge we will look at other things you can do with ggplot2 such as making artwork! This is known as generative art, which is produced either in part or completely by automated processes. 

Generative art is a complex topic, but some of the ideas and styles can be done using the aRtsy package, <https://koenderks.github.io/aRtsy/>, which makes generative art more accessible. 

First, you will need to install the aRtsy package. 
```{r eval=FALSE}
# install aRtsy
install.packages("aRtsy")
```

Then you will need to load it! 
```{r}
# load aRtsy
library(aRtsy)
```

When making generative art it is a good idea to make it reproducible as we there is a lot of randomisation involved. When randomising in R you need to *set a seed*, which in simple terms means we reproduce our results using the same seed. We use the `set.seed()` function and add in any number. The number is our seed. If we gave someone else our code and our seed they would be able to reproduce or results.

We've given some examples below on making a striped artwork and flow fields. Run the code chunk below, then try changing the seed to see how the results change when you run it again! 

Note: these will take a few moments to run! 
```{r}
# set the seed to 1
set.seed(1)

# make a colour palette from rcolorbrewer
set1 <- brewer.pal(n = 9, name = "Set1")
pastel1 <- brewer.pal(n = 9, name = "Pastel1")
paired <- brewer.pal(n = 12, name = "Paired")

# test out different parameters for stripes
canvas_stripes(paired, n = 800, H = 5, burnin = 5)

canvas_stripes(pastel1, n = 500, H = 15, burnin = 2)

# Test out different parameters for flow fields
canvas_flow(set1, background = "#fafafa", lines = 800, lwd = 0.30,
            iterations = 80, stepmax = 0.15)

pastel_flow <- canvas_flow(pastel1, background = "black", lines = 2000, lwd = 0.15,
            iterations = 30, stepmax = 0.10)

pastel_flow

# save pastel_flow
saveCanvas(pastel_flow, "pastel_flow.png")
```

Have a go yourself at making some generative art in R! Try out the following functions from aRtsy, changing the parameters to adjust the visualisation. 

- `canvas_flow()` <https://koenderks.github.io/aRtsy/reference/canvas_flow.html>
- `canvas_stripes()` <https://koenderks.github.io/aRtsy/reference/canvas_stripes.html>
- `canvas_watercolors()` <https://koenderks.github.io/aRtsy/reference/canvas_watercolors.html>

Don't forget to save any of your artwork you like using the `saveCanvas()` function. 
```{r}
set.seed(1)

# your code here


```


------------------------------------------------------------------------

# Recommened resources to continue your data visualiation learning

The ggplot2 book is an excellent resource with lots of examples and exercises to have a go at <https://ggplot2-book.org/>. 

Cedric Scherer writes blogs and tutorials on ggplot2 on his website. Some of his content is really great and worth looking through. Below are two of his tutorials to get you started:

- <https://www.cedricscherer.com/2019/08/05/a-ggplot2-tutorial-for-beautiful-plotting-in-r/>
- <https://www.cedricscherer.com/2019/05/17/the-evolution-of-a-ggplot-ep.-1/>

Georgios Karamanis is a data visualisation designer and makes some amazing visualisations using R! It's worth browsing his website for inspiration <https://karaman.is/> or following him on twitter <https://twitter.com/geokaramanis>.

For ideas about what to do with your data have a look at the R graph gallery <https://www.r-graph-gallery.com/>. 


